|
|
- earx presents - | | -+------------------------------+- --| some more optimisation crap! |-- -+------------------------------+- | | - introduction ------------------------------------------------------------- I hope you have some 68k and general coding knowledge before you start! Also, alot of examples prolly won't work on plain 68000, because of word access on odd address. - unrolling and using chunks ----------------------------------------------- Definetely a must on older 68k's and not entirely crap even on the newer 68060: unrolling a loop is still something that every coder should know. As you can see by the following example, trivial tasks such as copying a byte repeatedly takes overhead. We use bytes here. Yes, it could have been words as well, but it's just to give an impression. ; d0.w=number of bytes to move -1 loop: move.b (a0)+,(a1)+ ; Copy a word. dbf d0,loop Ofcourse the 'dbf' instruction takes some cycles. This is not as much as the 'move' cost, but still considerable. To make things worse there is a second and more implicit form of overhead. You see, a byte 'move' takes up as much time as an aligned word move. (*) So speaking in general we're twice as slow as wanted, or even slower as we will see later on. (*): this hasn't been verified for 040 or 060! The obvious way to speed up this situation is using word moves. In this case we have to take care of the one remaining byte move, like so: ; d0.w=number of bytes to move lsr.w d0 ; d0.w=number of words to move bcc.s amount_even ; Number was even -> skip byte! move.b (a0)+,(a1)+ ; Copy a byte. amount_even: ; d0.w=number of words to move subq.w #1,d0 ; d0.w=number of words to move -1 bmi.s end ; No words -> end! loop: move.w (a0)+,(a1)+ ; Copy a word. dbf d0,loop end: This is faster. Moving two bytes in one go is good. ; d0.w=number of bytes to move lsr.w d0 ; d0.w=number of words to move bcc.s amount_even ; Number was even -> skip byte! move.b (a0)+,(a1)+ ; Copy a byte. amount_even: ; d0.w=number of words to move lsr.w d0 ; d0.w=number of long to move bcc.s amount_longeven ; Number was even -> skip word! move.w (a0)+,(a1)+ ; Copy a word. amount_longeven: ; d0.w=number of longs to move subq.w #1,d0 ; d0.w=number of longs to move -1 bmi.s end ; No words -> end! loop: move.l (a0)+,(a1)+ ; Copy a long. dbf d0,loop end: Better still. This method has a drawback however, if we look closer at the 68k databus we see a hazard. The databus width is 16 or 32. This means the processor works on a 16bit (even) or 32bit (longeven) aligned addresses. So accessing a word/longword at an odd address actually implies two seperate accesses. This is all transparant to the coder. And it's also slower... Let's see how we can avoid this bummer. There are four situations: a) source even, destination even b) source even, destination odd c) source odd, destination even d) source odd, destination odd Situation a is trivial. Situation d needs an extra byte move to compensate for the oddness. For situations b and c there are remedies as well. Let's assume situation b here: ; d7.w=number of words to copy -1 clr.l d0 move.w (a0)+,d0 ; d0.l=$0000aabb lsl.l #8,d0 ; d0.l=$00aabb00 move.l d0,d1 ; d1.l=$00aabb00 swap d0 ; d0.l=$bb0000aa move.b d0,(a1)+ ; Store a (first byte). loop: lsl.l #8,d1 ; d1.l=$aabb0000 move.w (a0)+,d1 ; d1.l=$aabbccdd ror.l #8,d1 ; d1.l=$ddaabbcc move.w d1,(a1)+ ; Store word. dbf d7,loop As you can see this requires shifts and therefore it's useless on a 68000 without a barrelshifter. the 68030 and up have one of these, though. Also, we need alot of combining to account for all of these particularities. Noone said it would be easy ;) Let's neglect the technical details for a while and continue. Now we only have one branch instruction on average per two bytes. It can be far far less. We can unroll the whole wordloop, but besides taking up a possibly very large instructionspace, this doesn't help on machines with instruction cache... The best alternative is to unroll to a more humble sized loop that will fit in your cpu's cache. The jumptree loop method springs to mind. For convenience we use a situation where we copy words only: ; d0.w=#number of words to copy-1 move.w d0,d1 andi.w #32-1,d0 ; d0.w mod 32 lsr.w #5,d1 ; d1.w=/32=number of loops neg.w d0 jmp endjumptree(pc,d0.w*2) ; Jump into loop. loop: REPT 32 move.w (a0)+,(a1)+ ENDR endjumptree: dbf d1,loop This will fit into a 030's cache at least. And that about covers the byte copy issue. Unrolling techniques are ofcourse useful for alot of other situations. And it seems that for all operations there is a good unrolling solution. Let's take a linear interpolation example (used for gouraud shading for instance). This illustrates that even operations that have a size different from {2,4,8,16,..} bytes are easy to handle: ; d0.l,d1.l=fixed point index, step ; d7.w=number of words to process ; a0: destination array ; a1: source shading array lea jumpCountTable,a2 move.l (a2,d7.w*4),d7 ; Get offset and loops. move.w d7,d2 ; d2.w=number of loops swap d7 ; d7.w=offset in loop jmp endjumptree(pc,d7.w) ; Jump into loop. loop: REPT 16 move.w (a1,d0.w*2),(a0)+ addx.l d1,d0 ENDR endjumptree: dbf d2,loop Ofcourse we need some additional code to prepare the lookup table: lea jumpCountTable,a0 clr.l d0 move.w #MAX_WORDS-1,d7 loop: move.l d0,d1 divu.w #16,d1 ; Divide by unroll-amount. ; d1.l=remainder|quotient move.l d1,d2 swap d2 ; d2.w=remainder mulu.w #6,d2 ; d2.w=offset=remainder*opsize swap d2 move.w d1,d2 ; d2.l=offset|quotient move.l d2,(a0)+ addq.w #1,d0 dbf d7,loop This is the fastest way on a 020/030/040. On a 68000 it might be faster to unroll completely, if that is possible. On a 68060 it might not be required to unroll. One last note, ofcourse the fasest way to copy longwords is still a movem.l: movem.l (a5),d0-a4 movem.l d0-a4,(a6) From the 020 and on it's not a must because of the fast instructioncache. That about wraps it up for this part. - conditional branches ----------------------------------------------------- Selection is a process that has it's ugly sides. Think of pipeline stalls in the 040/060 when a conditional branch happens. Or using alot of cases which can reduce code to a bowl of spaghetti. Ofcourse conditional branches are inevitable, but are they required everywhere we use the process of selection? The answer is: no! Often there is a solution that takes less work, gives a better overview. And mostly is faster. A good start is a clipping problem. A straightforward example would be: tst.b d0 bpl.s clipped ; d0>=0 -> no clip clr.b d0 ; Clip to zero. clipped: The core of the code being the bcc instruction ofcourse. Now this might seem like a detour, but a reasonable approach is the following: tst.b d0 spl d1 ; d0>=0 -> d1.b=$FF else d1.b=0 and.b d1,d0 ; Mask out d0, if d0 was negative. Seems like a waste of registers, but think about being freed from pipeline stalls. Also, you don't need labels. You can make bugs with these if you just copy a piece of code and forget to update the branch-label. Another example is using the carry bit instructions. Instead of working clumsily like so... ; a := a + b ; d0.l=b.lo, d1.l=a.lo, d2.l=b.hi, d3.l=a.hi add.l d0,d1 bcc.s carry_done addq.l #1,d3 ; Perform carry-add. carry_done: add.l d2,d3 ... we do it like this: add.l d0,d1 addx.l d2,d3 A well known and often used application no doubt. Instructions like subx and rox(r/l) also have their uses. Routines that handle cases mostly suffer from a huge amount of compares and branch combinations. We know how this looks, but just for the purpose of illustration: cmpi.w #0,d0 beq case0 cmpi.w #1,d0 beq case1 cmpi.w #2,d0 beq case2 cmpi.w #3,d0 beq case3 cmpi.w #4,d0 beq case4 ... As we know, there is a more sensible approach: jsr jumptable(pc,d0.w*4) jumptable: bra.w case0 bra.w case1 bra.w case2 bra.w case3 bra.w case4 bra.w .. So a constant time solution is possible. This works nicely on integers, but what about more complex datatypes? Strings are a good example. They consist of a variable amount of primitive datatypes (bytes). Let's say we just use such a case-handler on a small number of strings: cmpi.l #"ASSW",(a0) bne.s no_asswipe_found move.l 4(a0),d0 move.b #" ",d0 cmpi.l #"IPE ",d0 beq handle_asswipe no_asswipe_found: cmpi.l #"ASSH",(a0) bne.s no_asshole_found move.l 4(a0),d0 move.b #" ",d0 cmpi.l #"OLE ",d0 beq handle_asshole no_asshole_found: cmpi.l #"BUTT",(a0) bne.s no_buttmunch_found cmpi.l #"MUNC",4(a0) bne.s no_buttmunch_found cmpi.b #"H",8(a0) bne.s no_buttmunch_found beq handle_asswipe no_buttmunch_found: cmpi.l #"BITC",(a0) bne.s no_bitch_found cmpi.l #"H",4(a0) beq handle_bitch no_bitch_found: .... handle_asswipe: ... handle_asshole: ... handle_buttmunch: ... handle_bitch: ... As we can see, there are four cases. Since the stringsizes differ, we can't use the good old jumptable approach. And doing the searching (linear search) like above does take some time. Especially when we consider alot of possible strings in our 'dictionary'. How on earth are we going to optimise this little baby then? The answer is found in any decent book on algorithms and is called hashing. This alone however, is no guarantee for speed-optimised code. There are many forms of hashing. There are static and dynamic flavours, true constant time lookups, or 'almost' constant time lookups, etc. You can even hardcode your hashing table if it's static. For those of you unfamiliar with the concept, allow me to explain. Hashing is a process of generating keys for datatypes and using these keys to index (almost) in constant time to the information we want. In this case: datatype: string of bytes, key: integer, information: handler routine. We could discuss the entire topic of hashing, but that would take up alot of time and space. We'll limit the scope to static hashing of strings in a hardcoded way, since this is easy to do in assembler. If we take only the four strings of the above example we get a pretty bad impression of the algorithm. So, keep in mind that it also works for dozens of strings. The way we handle this is by using a treestructure. We use jumptables for parts of a string. handle_string: lea jumptable,a1 ; a1: root jumptable loop: ; a1: jumptable of choice clr.l d0 move.b (a0)+,d0 ; d0.b=character cmpi.b #"A",d0 blt.s end cmpi.b #"Z",d0 bhi.s end ; d0<"A" | d0>"Z" -> not found subi.b #"A",d0 ; d0.l=index jsr (a1,d0.l*4) tst.l d0 beq loop ; d0=0 -> char is legal, loop end: ; d0.l= <0:not found, >1:found rts jumptable: bra.w test_ass ; 4 bytes bra.w handle_b ; 4 bytes moveq #-1,d0 ; / rts ; \ 4 bytes moveq #-1,d0 ; / rts ; \ 4 bytes ... .. test_ass: cmpi.w #"SS",(a0)+ bne.s no_ass lea ass_jumptable,a1 moveq #0,d0 rts no_ass: moveq #-1,d0 rts handle_b: lea b_jumptable,a1 moveq #0,d0 rts ass_jumptable: moveq #-1,d0 ; a rts moveq #-1,d0 ; b rts ... .. bra.w test_asshole ; h ... .. bra.w test_asswipe ; w ... .. test_asshole: cmpi.l #"HOLE",(a0)+ bne.s no_asshole ; asshole handling here.. moveq #1,d0 ; Found. rts no_asshole: moveq #-1,d0 ; Not found. rts test_asswipe: cmpi.l #"WIPE",(a0)+ bne.s no_asswipe ; asswipe handling here.. moveq #1,d0 ; Found. rts no_asswipe: moveq #-1,d0 ; Not found. rts b_jumptable: moveq #-1,d0 ; a rts moveq #-1,d0 ; b rts ... .. bra.w test_bitch ; i ... .. bra.w test_butthole ; u ... .. test_bitch: cmpi.l #"ITCH",(a0)+ bne.s no_bitch ; bitch handling here.. moveq #1,d0 ; Found. rts no_bitch: moveq #-1,d0 ; Not found. rts test_butthole: cmpi.l #"UTTH",(a0)+ bne.s no_butthole move.l (a0)+,d0 move.b #" ",d0 cmpi.l #"OLE ",d0 bne.s no_butthole ; butthole handling here.. moveq #1,d0 ; Found. rts no_buttho old_cacr: DS.L 1 Well.. that's about it. Have fun optimising! - earx -------------------------------------------------------------- 2002 - |
|