- earx presents -
| |
-+------------------------------+-
--| some more optimisation crap! |--
-+------------------------------+-
| |
- introduction -------------------------------------------------------------
I hope you have some 68k and general coding knowledge before you start!
Also, alot of examples prolly won't work on plain 68000, because of word
access on odd address.
- unrolling and using chunks -----------------------------------------------
Definetely a must on older 68k's and not entirely crap even on the newer
68060: unrolling a loop is still something that every coder should know.
As you can see by the following example, trivial tasks such as copying
a byte repeatedly takes overhead. We use bytes here. Yes, it could have
been words as well, but it's just to give an impression.
; d0.w=number of bytes to move -1
loop: move.b (a0)+,(a1)+ ; Copy a word.
dbf d0,loop
Ofcourse the 'dbf' instruction takes some cycles. This is not as much
as the 'move' cost, but still considerable. To make things worse there is
a second and more implicit form of overhead. You see, a byte 'move' takes
up as much time as an aligned word move. (*) So speaking in general we're
twice as slow as wanted, or even slower as we will see later on.
(*): this hasn't been verified for 040 or 060!
The obvious way to speed up this situation is using word moves. In this case
we have to take care of the one remaining byte move, like so:
; d0.w=number of bytes to move
lsr.w d0 ; d0.w=number of words to move
bcc.s amount_even ; Number was even -> skip byte!
move.b (a0)+,(a1)+ ; Copy a byte.
amount_even:
; d0.w=number of words to move
subq.w #1,d0 ; d0.w=number of words to move -1
bmi.s end ; No words -> end!
loop: move.w (a0)+,(a1)+ ; Copy a word.
dbf d0,loop
end:
This is faster. Moving two bytes in one go is good.
; d0.w=number of bytes to move
lsr.w d0 ; d0.w=number of words to move
bcc.s amount_even ; Number was even -> skip byte!
move.b (a0)+,(a1)+ ; Copy a byte.
amount_even:
; d0.w=number of words to move
lsr.w d0 ; d0.w=number of long to move
bcc.s amount_longeven ; Number was even -> skip word!
move.w (a0)+,(a1)+ ; Copy a word.
amount_longeven:
; d0.w=number of longs to move
subq.w #1,d0 ; d0.w=number of longs to move -1
bmi.s end ; No words -> end!
loop: move.l (a0)+,(a1)+ ; Copy a long.
dbf d0,loop
end:
Better still. This method has a drawback however, if we look closer at the
68k databus we see a hazard. The databus width is 16 or 32. This means the
processor works on a 16bit (even) or 32bit (longeven) aligned addresses. So
accessing a word/longword at an odd address actually implies two seperate
accesses. This is all transparant to the coder. And it's also slower...
Let's see how we can avoid this bummer. There are four situations:
a) source even, destination even
b) source even, destination odd
c) source odd, destination even
d) source odd, destination odd
Situation a is trivial. Situation d needs an extra byte move to compensate
for the oddness. For situations b and c there are remedies as well. Let's
assume situation b here:
; d7.w=number of words to copy -1
clr.l d0
move.w (a0)+,d0 ; d0.l=$0000aabb
lsl.l #8,d0 ; d0.l=$00aabb00
move.l d0,d1 ; d1.l=$00aabb00
swap d0 ; d0.l=$bb0000aa
move.b d0,(a1)+ ; Store a (first byte).
loop: lsl.l #8,d1 ; d1.l=$aabb0000
move.w (a0)+,d1 ; d1.l=$aabbccdd
ror.l #8,d1 ; d1.l=$ddaabbcc
move.w d1,(a1)+ ; Store word.
dbf d7,loop
As you can see this requires shifts and therefore it's useless on a 68000
without a barrelshifter. the 68030 and up have one of these, though. Also,
we need alot of combining to account for all of these particularities.
Noone said it would be easy ;)
Let's neglect the technical details for a while and continue. Now we only
have one branch instruction on average per two bytes. It can be far far
less. We can unroll the whole wordloop, but besides taking up a possibly
very large instructionspace, this doesn't help on machines with instruction
cache...
The best alternative is to unroll to a more humble sized loop that will
fit in your cpu's cache. The jumptree loop method springs to mind. For
convenience we use a situation where we copy words only:
; d0.w=#number of words to copy-1
move.w d0,d1
andi.w #32-1,d0 ; d0.w mod 32
lsr.w #5,d1 ; d1.w=/32=number of loops
neg.w d0
jmp endjumptree(pc,d0.w*2) ; Jump into loop.
loop: REPT 32
move.w (a0)+,(a1)+
ENDR
endjumptree:
dbf d1,loop
This will fit into a 030's cache at least. And that about covers the byte
copy issue.
Unrolling techniques are ofcourse useful for alot of other situations. And
it seems that for all operations there is a good unrolling solution. Let's
take a linear interpolation example (used for gouraud shading for instance).
This illustrates that even operations that have a size different from
{2,4,8,16,..} bytes are easy to handle:
; d0.l,d1.l=fixed point index, step
; d7.w=number of words to process
; a0: destination array
; a1: source shading array
lea jumpCountTable,a2
move.l (a2,d7.w*4),d7 ; Get offset and loops.
move.w d7,d2 ; d2.w=number of loops
swap d7 ; d7.w=offset in loop
jmp endjumptree(pc,d7.w) ; Jump into loop.
loop: REPT 16
move.w (a1,d0.w*2),(a0)+
addx.l d1,d0
ENDR
endjumptree:
dbf d2,loop
Ofcourse we need some additional code to prepare the lookup table:
lea jumpCountTable,a0
clr.l d0
move.w #MAX_WORDS-1,d7
loop: move.l d0,d1
divu.w #16,d1 ; Divide by unroll-amount.
; d1.l=remainder|quotient
move.l d1,d2
swap d2 ; d2.w=remainder
mulu.w #6,d2 ; d2.w=offset=remainder*opsize
swap d2
move.w d1,d2 ; d2.l=offset|quotient
move.l d2,(a0)+
addq.w #1,d0
dbf d7,loop
This is the fastest way on a 020/030/040. On a 68000 it might be faster to
unroll completely, if that is possible. On a 68060 it might not be required
to unroll.
One last note, ofcourse the fasest way to copy longwords is still a movem.l:
movem.l (a5),d0-a4
movem.l d0-a4,(a6)
From the 020 and on it's not a must because of the fast instructioncache.
That about wraps it up for this part.
- conditional branches -----------------------------------------------------
Selection is a process that has it's ugly sides. Think of pipeline stalls
in the 040/060 when a conditional branch happens. Or using alot of cases
which can reduce code to a bowl of spaghetti. Ofcourse conditional branches
are inevitable, but are they required everywhere we use the process of
selection?
The answer is: no! Often there is a solution that takes less work, gives
a better overview. And mostly is faster. A good start is a clipping problem.
A straightforward example would be:
tst.b d0
bpl.s clipped ; d0>=0 -> no clip
clr.b d0 ; Clip to zero.
clipped:
The core of the code being the bcc instruction ofcourse. Now this might seem
like a detour, but a reasonable approach is the following:
tst.b d0
spl d1 ; d0>=0 -> d1.b=$FF else d1.b=0
and.b d1,d0 ; Mask out d0, if d0 was negative.
Seems like a waste of registers, but think about being freed from pipeline
stalls. Also, you don't need labels. You can make bugs with these if you
just copy a piece of code and forget to update the branch-label.
Another example is using the carry bit instructions. Instead of working
clumsily like so...
; a := a + b
; d0.l=b.lo, d1.l=a.lo, d2.l=b.hi, d3.l=a.hi
add.l d0,d1
bcc.s carry_done
addq.l #1,d3 ; Perform carry-add.
carry_done:
add.l d2,d3
... we do it like this:
add.l d0,d1
addx.l d2,d3
A well known and often used application no doubt. Instructions like subx
and rox(r/l) also have their uses.
Routines that handle cases mostly suffer from a huge amount of compares and
branch combinations. We know how this looks, but just for the purpose of
illustration:
cmpi.w #0,d0
beq case0
cmpi.w #1,d0
beq case1
cmpi.w #2,d0
beq case2
cmpi.w #3,d0
beq case3
cmpi.w #4,d0
beq case4
...
As we know, there is a more sensible approach:
jsr jumptable(pc,d0.w*4)
jumptable:
bra.w case0
bra.w case1
bra.w case2
bra.w case3
bra.w case4
bra.w ..
So a constant time solution is possible. This works nicely on integers, but
what about more complex datatypes? Strings are a good example. They consist
of a variable amount of primitive datatypes (bytes). Let's say we just use
such a case-handler on a small number of strings:
cmpi.l #"ASSW",(a0)
bne.s no_asswipe_found
move.l 4(a0),d0
move.b #" ",d0
cmpi.l #"IPE ",d0
beq handle_asswipe
no_asswipe_found:
cmpi.l #"ASSH",(a0)
bne.s no_asshole_found
move.l 4(a0),d0
move.b #" ",d0
cmpi.l #"OLE ",d0
beq handle_asshole
no_asshole_found:
cmpi.l #"BUTT",(a0)
bne.s no_buttmunch_found
cmpi.l #"MUNC",4(a0)
bne.s no_buttmunch_found
cmpi.b #"H",8(a0)
bne.s no_buttmunch_found
beq handle_asswipe
no_buttmunch_found:
cmpi.l #"BITC",(a0)
bne.s no_bitch_found
cmpi.l #"H",4(a0)
beq handle_bitch
no_bitch_found:
....
handle_asswipe:
...
handle_asshole:
...
handle_buttmunch:
...
handle_bitch:
...
As we can see, there are four cases. Since the stringsizes differ, we can't
use the good old jumptable approach. And doing the searching (linear search)
like above does take some time. Especially when we consider alot of possible
strings in our 'dictionary'. How on earth are we going to optimise this
little baby then?
The answer is found in any decent book on algorithms and is called hashing.
This alone however, is no guarantee for speed-optimised code. There are
many forms of hashing. There are static and dynamic flavours, true constant
time lookups, or 'almost' constant time lookups, etc. You can even hardcode
your hashing table if it's static.
For those of you unfamiliar with the concept, allow me to explain. Hashing
is a process of generating keys for datatypes and using these keys to index
(almost) in constant time to the information we want. In this case:
datatype: string of bytes,
key: integer,
information: handler routine.
We could discuss the entire topic of hashing, but that would take up alot
of time and space. We'll limit the scope to static hashing of strings in a
hardcoded way, since this is easy to do in assembler. If we take only the
four strings of the above example we get a pretty bad impression of the
algorithm. So, keep in mind that it also works for dozens of strings.
The way we handle this is by using a treestructure. We use jumptables for
parts of a string.
handle_string:
lea jumptable,a1 ; a1: root jumptable
loop:
; a1: jumptable of choice
clr.l d0
move.b (a0)+,d0 ; d0.b=character
cmpi.b #"A",d0
blt.s end
cmpi.b #"Z",d0
bhi.s end ; d0<"A" | d0>"Z" -> not found
subi.b #"A",d0 ; d0.l=index
jsr (a1,d0.l*4)
tst.l d0
beq loop ; d0=0 -> char is legal, loop
end:
; d0.l= <0:not found, >1:found
rts
jumptable:
bra.w test_ass ; 4 bytes
bra.w handle_b ; 4 bytes
moveq #-1,d0 ; /
rts ; \ 4 bytes
moveq #-1,d0 ; /
rts ; \ 4 bytes
...
..
test_ass:
cmpi.w #"SS",(a0)+
bne.s no_ass
lea ass_jumptable,a1
moveq #0,d0
rts
no_ass: moveq #-1,d0
rts
handle_b:
lea b_jumptable,a1
moveq #0,d0
rts
ass_jumptable:
moveq #-1,d0 ; a
rts
moveq #-1,d0 ; b
rts
...
..
bra.w test_asshole ; h
...
..
bra.w test_asswipe ; w
...
..
test_asshole:
cmpi.l #"HOLE",(a0)+
bne.s no_asshole
; asshole handling here..
moveq #1,d0 ; Found.
rts
no_asshole:
moveq #-1,d0 ; Not found.
rts
test_asswipe:
cmpi.l #"WIPE",(a0)+
bne.s no_asswipe
; asswipe handling here..
moveq #1,d0 ; Found.
rts
no_asswipe:
moveq #-1,d0 ; Not found.
rts
b_jumptable:
moveq #-1,d0 ; a
rts
moveq #-1,d0 ; b
rts
...
..
bra.w test_bitch ; i
...
..
bra.w test_butthole ; u
...
..
test_bitch:
cmpi.l #"ITCH",(a0)+
bne.s no_bitch
; bitch handling here..
moveq #1,d0 ; Found.
rts
no_bitch:
moveq #-1,d0 ; Not found.
rts
test_butthole:
cmpi.l #"UTTH",(a0)+
bne.s no_butthole
move.l (a0)+,d0
move.b #" ",d0
cmpi.l #"OLE ",d0
bne.s no_butthole
; butthole handling here..
moveq #1,d0 ; Found.
rts
no_buttho
old_cacr:
DS.L 1
Well.. that's about it. Have fun optimising!
- earx -------------------------------------------------------------- 2002 -
|