Alive 6 - opti

01 - 02 - SE - 03 - 04 - 05 - 06 - 07 - 08 - 09 - 10 - 11 - 12 - 13 - 14
Alive 6
            - earx presents -
    |                              |
   -+------------------------------+-
  --| some more optimisation crap! |--
   -+------------------------------+-
    |                              |

- introduction -------------------------------------------------------------

I hope you have some 68k and general coding knowledge before you start!
Also, alot of examples prolly won't work on plain 68000, because of word
access on odd address.

- unrolling and using chunks -----------------------------------------------

Definetely a must on older 68k's and not entirely crap even on the newer
68060: unrolling a loop is still something that every coder should know.

As you can see by the following example, trivial tasks such as copying
a byte repeatedly takes overhead. We use bytes here. Yes, it could have
been words as well, but it's just to give an impression.

; d0.w=number of bytes to move -1
loop:   move.b  (a0)+,(a1)+             ; Copy a word.
        dbf     d0,loop

Ofcourse the 'dbf' instruction takes some cycles. This is not as much
as the 'move' cost, but still considerable. To make things worse there is
a second and more implicit form of overhead. You see, a byte 'move' takes
up as much time as an aligned word move. (*) So speaking in general we're
twice as slow as wanted, or even slower as we will see later on.

(*): this hasn't been verified for 040 or 060!

The obvious way to speed up this situation is using word moves. In this case
we have to take care of the one remaining byte move, like so:

; d0.w=number of bytes to move
        lsr.w   d0                      ; d0.w=number of words to move
        bcc.s   amount_even             ; Number was even -> skip byte!
        move.b  (a0)+,(a1)+             ; Copy a byte.
amount_even:

; d0.w=number of words to move
        subq.w  #1,d0                   ; d0.w=number of words to move -1
        bmi.s   end                     ; No words -> end!
loop:   move.w  (a0)+,(a1)+             ; Copy a word.
        dbf     d0,loop
end:

This is faster. Moving two bytes in one go is good.

; d0.w=number of bytes to move
        lsr.w   d0                      ; d0.w=number of words to move
        bcc.s   amount_even             ; Number was even -> skip byte!
        move.b  (a0)+,(a1)+             ; Copy a byte.
amount_even:

; d0.w=number of words to move
        lsr.w   d0                      ; d0.w=number of long to move
        bcc.s   amount_longeven         ; Number was even -> skip word!
        move.w  (a0)+,(a1)+             ; Copy a word.
amount_longeven:

; d0.w=number of longs to move
        subq.w  #1,d0                   ; d0.w=number of longs to move -1
        bmi.s   end                     ; No words -> end!
loop:   move.l  (a0)+,(a1)+             ; Copy a long.
        dbf     d0,loop
end:

Better still. This method has a drawback however, if we look closer at the
68k databus we see a hazard. The databus width is 16 or 32. This means the
processor works on a 16bit (even) or 32bit (longeven) aligned addresses. So
accessing a word/longword at an odd address actually implies two seperate
accesses. This is all transparant to the coder. And it's also slower...

Let's see how we can avoid this bummer. There are four situations:
a) source even, destination even
b) source even, destination odd
c) source odd, destination even
d) source odd, destination odd

Situation a is trivial. Situation d needs an extra byte move to compensate
for the oddness. For situations b and c there are remedies as well. Let's
assume situation b here:

; d7.w=number of words to copy -1
        clr.l   d0
        move.w  (a0)+,d0                ; d0.l=$0000aabb
        lsl.l   #8,d0                   ; d0.l=$00aabb00
        move.l  d0,d1                   ; d1.l=$00aabb00
        swap    d0                      ; d0.l=$bb0000aa
        move.b  d0,(a1)+                ; Store a (first byte).

loop:   lsl.l   #8,d1                   ; d1.l=$aabb0000
        move.w  (a0)+,d1                ; d1.l=$aabbccdd
        ror.l   #8,d1                   ; d1.l=$ddaabbcc
        move.w  d1,(a1)+                ; Store word.
        dbf     d7,loop

As you can see this requires shifts and therefore it's useless on a 68000
without a barrelshifter. the 68030 and up have one of these, though. Also,
we need alot of combining to account for all of these particularities.
Noone said it would be easy ;)

Let's neglect the technical details for a while and continue. Now we only
have one branch instruction on average per two bytes. It can be far far
less. We can unroll the whole wordloop, but besides taking up a possibly
very large instructionspace, this doesn't help on machines with instruction
cache...

The best alternative is to unroll to a more humble sized loop that will
fit in your cpu's cache. The jumptree loop method springs to mind. For
convenience we use a situation where we copy words only:

; d0.w=#number of words to copy-1
        move.w  d0,d1
        andi.w  #32-1,d0                ; d0.w mod 32
        lsr.w   #5,d1                   ; d1.w=/32=number of loops
        neg.w   d0
        jmp     endjumptree(pc,d0.w*2)  ; Jump into loop.
loop:   REPT    32
        move.w  (a0)+,(a1)+
        ENDR
endjumptree:
        dbf     d1,loop

This will fit into a 030's cache at least. And that about covers the byte
copy issue.

Unrolling techniques are ofcourse useful for alot of other situations. And
it seems that for all operations there is a good unrolling solution. Let's
take a linear interpolation example (used for gouraud shading for instance).
This illustrates that even operations that have a size different from
{2,4,8,16,..} bytes are easy to handle:

; d0.l,d1.l=fixed point index, step
; d7.w=number of words to process
; a0: destination array
; a1: source shading array
        lea     jumpCountTable,a2
        move.l  (a2,d7.w*4),d7          ; Get offset and loops.
        move.w  d7,d2                   ; d2.w=number of loops
        swap    d7                      ; d7.w=offset in loop
        jmp     endjumptree(pc,d7.w)    ; Jump into loop.

loop:   REPT    16
        move.w  (a1,d0.w*2),(a0)+
        addx.l  d1,d0
        ENDR
endjumptree:
        dbf     d2,loop

Ofcourse we need some additional code to prepare the lookup table:

        lea     jumpCountTable,a0
        clr.l   d0
        move.w  #MAX_WORDS-1,d7
loop:   move.l  d0,d1
        divu.w  #16,d1                  ; Divide by unroll-amount.
; d1.l=remainder|quotient
        move.l  d1,d2
        swap    d2                      ; d2.w=remainder
        mulu.w  #6,d2                   ; d2.w=offset=remainder*opsize
        swap    d2
        move.w  d1,d2                   ; d2.l=offset|quotient
        move.l  d2,(a0)+
        addq.w  #1,d0
        dbf     d7,loop

This is the fastest way on a 020/030/040. On a 68000 it might be faster to
unroll completely, if that is possible. On a 68060 it might not be required
to unroll.

One last note, ofcourse the fasest way to copy longwords is still a movem.l:

        movem.l (a5),d0-a4
        movem.l d0-a4,(a6)

From the 020 and on it's not a must because of the fast instructioncache.
That about wraps it up for this part.

- conditional branches -----------------------------------------------------

Selection is a process that has it's ugly sides. Think of pipeline stalls
in the 040/060 when a conditional branch happens. Or using alot of cases
which can reduce code to a bowl of spaghetti. Ofcourse conditional branches
are inevitable, but are they required everywhere we use the process of
selection?

The answer is: no! Often there is a solution that takes less work, gives
a better overview. And mostly is faster. A good start is a clipping problem.
A straightforward example would be:

        tst.b   d0
        bpl.s   clipped                 ; d0>=0 -> no clip
        clr.b   d0                      ; Clip to zero.
clipped:

The core of the code being the bcc instruction ofcourse. Now this might seem
like a detour, but a reasonable approach is the following:

        tst.b   d0
        spl     d1                      ; d0>=0 -> d1.b=$FF else d1.b=0
        and.b   d1,d0                   ; Mask out d0, if d0 was negative.

Seems like a waste of registers, but think about being freed from pipeline
stalls. Also, you don't need labels. You can make bugs with these if you
just copy a piece of code and forget to update the branch-label.

Another example is using the carry bit instructions. Instead of working
clumsily like so...

; a := a + b
; d0.l=b.lo, d1.l=a.lo, d2.l=b.hi, d3.l=a.hi
        add.l   d0,d1
        bcc.s   carry_done
        addq.l  #1,d3                   ; Perform carry-add.
carry_done:
        add.l   d2,d3

... we do it like this:

        add.l   d0,d1
        addx.l  d2,d3

A well known and often used application no doubt. Instructions like subx
and rox(r/l) also have their uses.

Routines that handle cases mostly suffer from a huge amount of compares and
branch combinations. We know how this looks, but just for the purpose of
illustration:

        cmpi.w  #0,d0
        beq     case0
        cmpi.w  #1,d0
        beq     case1
        cmpi.w  #2,d0
        beq     case2
        cmpi.w  #3,d0
        beq     case3
        cmpi.w  #4,d0
        beq     case4
        ...

As we know, there is a more sensible approach:

        jsr     jumptable(pc,d0.w*4)

jumptable:
        bra.w   case0
        bra.w   case1
        bra.w   case2
        bra.w   case3
        bra.w   case4
        bra.w   ..

So a constant time solution is possible. This works nicely on integers, but
what about more complex datatypes? Strings are a good example. They consist
of a variable amount of primitive datatypes (bytes). Let's say we just use
such a case-handler on a small number of strings:

        cmpi.l  #"ASSW",(a0)
        bne.s   no_asswipe_found
        move.l  4(a0),d0
        move.b  #" ",d0
        cmpi.l  #"IPE ",d0
        beq     handle_asswipe
no_asswipe_found:
        cmpi.l  #"ASSH",(a0)
        bne.s   no_asshole_found
        move.l  4(a0),d0
        move.b  #" ",d0
        cmpi.l  #"OLE ",d0
        beq     handle_asshole
no_asshole_found:
        cmpi.l  #"BUTT",(a0)
        bne.s   no_buttmunch_found
        cmpi.l  #"MUNC",4(a0)
        bne.s   no_buttmunch_found
        cmpi.b  #"H",8(a0)
        bne.s   no_buttmunch_found
        beq     handle_asswipe
no_buttmunch_found:
        cmpi.l  #"BITC",(a0)
        bne.s   no_bitch_found
        cmpi.l  #"H",4(a0)
        beq     handle_bitch
no_bitch_found:

        ....

handle_asswipe:
        ...
handle_asshole:
        ...
handle_buttmunch:
        ...
handle_bitch:
        ...

As we can see, there are four cases. Since the stringsizes differ, we can't
use the good old jumptable approach. And doing the searching (linear search)
like above does take some time. Especially when we consider alot of possible
strings in our 'dictionary'. How on earth are we going to optimise this
little baby then?

The answer is found in any decent book on algorithms and is called hashing.
This alone however, is no guarantee for speed-optimised code. There are
many forms of hashing. There are static and dynamic flavours, true constant
time lookups, or 'almost' constant time lookups, etc. You can even hardcode
your hashing table if it's static.

For those of you unfamiliar with the concept, allow me to explain. Hashing
is a process of generating keys for datatypes and using these keys to index
(almost) in constant time to the information we want. In this case:

datatype: string of bytes,
key: integer,
information: handler routine.

We could discuss the entire topic of hashing, but that would take up alot
of time and space. We'll limit the scope to static hashing of strings in a
hardcoded way, since this is easy to do in assembler. If we take only the
four strings of the above example we get a pretty bad impression of the
algorithm. So, keep in mind that it also works for dozens of strings.

The way we handle this is by using a treestructure. We use jumptables for
parts of a string.

handle_string:
        lea     jumptable,a1            ; a1: root jumptable
loop:
; a1: jumptable of choice
        clr.l   d0
        move.b  (a0)+,d0                ; d0.b=character
        cmpi.b  #"A",d0
        blt.s   end
        cmpi.b  #"Z",d0
        bhi.s   end                     ; d0<"A" | d0>"Z" -> not found
        subi.b  #"A",d0                 ; d0.l=index
        jsr     (a1,d0.l*4)
        tst.l   d0
        beq     loop                    ; d0=0 -> char is legal, loop

end:
; d0.l= <0:not found, >1:found
        rts

jumptable:
        bra.w   test_ass                ; 4 bytes
        bra.w   handle_b                ; 4 bytes
        moveq   #-1,d0                  ; /
        rts                             ; \ 4 bytes
        moveq   #-1,d0                  ; /
        rts                             ; \ 4 bytes
        ...
        ..

test_ass:
        cmpi.w  #"SS",(a0)+
        bne.s   no_ass
        lea     ass_jumptable,a1
        moveq   #0,d0
        rts
no_ass: moveq   #-1,d0
        rts

handle_b:
        lea     b_jumptable,a1
        moveq   #0,d0
        rts

ass_jumptable:
        moveq   #-1,d0                  ; a
        rts
        moveq   #-1,d0                  ; b
        rts
        ...
        ..
        bra.w   test_asshole            ; h
        ...
        ..
        bra.w   test_asswipe            ; w
        ...
        ..

test_asshole:
        cmpi.l  #"HOLE",(a0)+
        bne.s   no_asshole

; asshole handling here..

        moveq   #1,d0                   ; Found.
        rts
no_asshole:
        moveq   #-1,d0                  ; Not found.
        rts

test_asswipe:
        cmpi.l  #"WIPE",(a0)+
        bne.s   no_asswipe

; asswipe handling here..

        moveq   #1,d0                   ; Found.
        rts
no_asswipe:
        moveq   #-1,d0                  ; Not found.
        rts

b_jumptable:
        moveq   #-1,d0                  ; a
        rts
        moveq   #-1,d0                  ; b
        rts
        ...
        ..
        bra.w   test_bitch              ; i
        ...
        ..
        bra.w   test_butthole           ; u
        ...
        ..

test_bitch:
        cmpi.l  #"ITCH",(a0)+
        bne.s   no_bitch

; bitch handling here..

        moveq   #1,d0                   ; Found.
        rts
no_bitch:
        moveq   #-1,d0                  ; Not found.
        rts

test_butthole:
        cmpi.l  #"UTTH",(a0)+
        bne.s   no_butthole
        move.l  (a0)+,d0
        move.b  #" ",d0
        cmpi.l  #"OLE ",d0
        bne.s   no_butthole

; butthole handling here..

        moveq   #1,d0                   ; Found.
        rts
no_buttho
old_cacr:
   DS.L   1

Well.. that's about it. Have fun optimising!

- earx -------------------------------------------------------------- 2002 -
Alive 6