News Team Current issue History Online Support Download Forum @Pouet

01 - 02 - SE - 03 - 04 - 05 - 06 - 07 - 08 - 09 - 10 - 11 - 12 - 13 - 14

Alive 12
Comparison between the
M68030 and M68060  (Part 1)
        Everybody knows that a Falcon with a ct60 board is faster than a 
        standard Falcon030. We know even more as Rodolphe Czuba (the 
        famous developer of the ct60) states on his Webpage ( ):
           CPU                                       MIPS       Factor
           68060 @ 100 MHz (CT60)                  147.00         25.5
           68060 @ 66 MHz (CT60)                    96.80         16.8
           68030 @ 16 MHz (Falcon / Motorola 2001)   5.76          1.0
        Some people maybe made the following calculations:
           66 MHz / 16 MHz          = 4.125
        So a 030 with 66MHz should have:
           5.76 MIPS * 4.125= 23.76 MIPS
        Finally, at the same frequency ( 66MHz) a 060 has:
           96.80 MIPS / 23.76 MIPS  = 4.07 times more MIPS.
        Note: The above calculation is intended to be a motivation and 
              not an analysis.
        It seems like there is something that makes the 060 four times 
        faster than an equally clocked 030. This fact is not surprising. 
        The 68030 was released in 1987 and the 68060 in 1994. And it 
        should be no secret that technology doesn't stand still in seven 
        years :)
        For those who are curious how you can achieve a speedup of four 
        are welcome to learn more in this article. Those guys who are 
        used to coding the 030 and now want to get the most performance 
        out of this CPU will maybe find some interesting information 
        inside this article as well. My intention for this article is to 
        describe the architecture of the 68060 and compare it to the 
        68030. I totally understand that I am not the first who 
        "discovered" the 060. Just think about those AMIGA coders who 
        have a 060 speeder card for a long time and know every possible 
        trick. But maybe this article helps you to understand why some 
        things you heard about the 060 are like they are. Things like:
            - "Align your code and data to 16 byte boundaries."
            - "060 has 0-cycle branches."
            - "060 is superscalar"
        I assume, there are at least two kinds of readers. The first 
        ones are familiar with CPU design, but don't know the exact 
        features of the 060. The second one are those who have no idea 
        at all. For those I have included some hopefully good 
        explanations of important features which a CPU contains 
        nowadays. But this makes this whole article very lengthy at 
        least for the technically unaware. The techies may skip those 
        sections :)
        An important parameter for the CPU designer is the amount of 
        Cycles Per Instruction (CPI). This figure tells you how many 
        clock cycles a CPU needs to execute an instruction. Some 
        examples for 030 and 060:
          Instruction                         68030       68060
        1   move.l      d0, d1                  2           1
        2   add.l       d0, d1                  2           1
        3   lsl.l       #6, d0                  2           1
        4   mulu.w      d0, d1                 38           2
        5   divu.w      d0, d1                 78          36
        6   add.l       (a0)+,d0                5           1
        7   add.l       (a0,d7.w*2),d0          8           1
        Although you can see that the CPI varies from instruction to 
        instruction, it can be said that the 060 has CPI = 1. 
        Instructions which take more cycles are special instructions 
        like ReTurn from Subroutine ( RTS) or the DIV and MUL 
        instructions, but comparing to the 030 they are nonetheless 
        fast. Now let's have a look at the lines 2, 6 and 7. On the 030 
        side you should read the cycles count in line 6 and 7 like "2+3" 
        and "2+6" respectively or a bit formal "I+A".
        I: cycles to execute the operation
        A: cycles needed to calculate address of operand.
        Q: How come that the 060 needs 1 cycle regardless of the
           addressing mode?
        A: It uses pipelining.
        Pipelining is often used throughout a lot of areas in computing 
        science, so we start with a
        Suppose, a task A  is divideable into a series of smaller tasks 
        A_1, A_2,..., A_n, so that executing the series A_1 to A_n has 
        the same effect as executing A. If we have the possibility to 
        execute the subtasks A_1 to A_n in parallel, then we talk about 
        Consider the task "Building a car". You can have a lot of 
        workers in a garage who put auto body, tyres, window pane and so 
        on together to build a car. We assume that it takes every worker 
        an hour to do his job. Having 5 workers a car is produced every 
        5 hours.
        Now picture an assembly line. After the first worker has fixed 
        the car body ( after one hour) and the car has moved to the 2nd 
        worker, he can start putting the next car body onto the next car 
        and so on. Still it takes 5 hours to build the FIRST car, but 
        after that, a car is produced every hour.
        With the assembly garage you need 20 hours to build 4 cars. With 
        the assembly line you need just 5+3 = 8 hours.
        The 68060 Pipeline
        A first (not yet complete) look onto the pipeline of the 060:
        |                       |
        |           DECODE      |
        |                       |
        |                       |
        |           EA CALC     |
        |                       |
        |                       |
        |           EA FETCH    |
        |                       |
        |                       |
        |           EXECUTE     |
        |                       |
        Note: EA is a common abbreviation for Effective Address.
        In the following, I describe the subtasks which are executed in 
        the phases noted above. To illustrate it a bit I will ask an 
        example to accompany us. The example will be the ADD 
        instruction. An instruction is presented to the CPU as an 16-bit 
        instruction word. The ADD instruction word looks like this:
           |1101| REGISTER | OPMODE | EA |
        Where "1101" is a four bit code, which is used by the DECODE 
        phase to recognize the instruction (-type). "register" is the 
        bitmask which represents the CPU register used in the 
        instruction. "opmode" describes whether the register is the 
        source or the destination operand. "ea" is the code for the 
        address of the second operand. This could be as simple as an 
        other register or a (complex) address in memory. We will 
        consider the following instruction:
           ADD         d3, 25(a0,d1*2)
        This instruction adds the value which is inside the register d3 
        to the value in memory at the specified address.
        Now the description of the phases:
        DECODE: in this phase the CPU recognizes the instruction and 
        starts transferring data from registers to the ALU.
        EA CALC:the addresses of opcodes is calculated in this phase. 
        The CPU has extra hardware for the rather simple calculations. 
        This is needed for the pipelining, else the ALU of the CPU could 
        be used. Considering the example, the value "25" is added to the 
        contents of the register a0 and the contents of d1 ( multiplied 
        by 2) as well.
        EA FETCH: data from the memory is fetched in this phase. 
        Considering the example, the value from memory is fetched at the 
        address calculated by the EA CALC phase and passed to the ALU.
        EXECUTE: after all data is present at the ALU, the instruction 
        can finally be executed.
        In short, that an instructions execution time is independent of 
        the addressing mode is due to the fact that the calculation and 
        fetching of the data is done while other instructions are in the 
        execution phase. It should be noted that the slowest phase is 
        crucial for the overall performance. If any of the phases of the 
        pipeline needs more than one cycle the instruction needs more 
        than one cycle as well.
        Above the pipeline of the 68060 is described, but what about the 
        68030? The 030 has a simple pipelined architecture. But this is 
        only for the decoding. The decoding is splitted-up into three 
        steps which get executed in parallel. The address calculation 
        and the real actual execution of the instruction is not.
        Summary and outlook
        To sum up the above, the pipeline is one key feature which makes 
        the 68060 faster than the 68030. As I stated above, the 060 as a 
        CPI equal to 1 for most of the standard instructions because of 
        the instruction pipeline. But this is only true if a few 
        assumptions are made!
        First, when data is to be fetched from memory the CPU has to 
        wait for the data ( wait-states) and the second, what if we 
        change the flow of instructions? Keep in mind, the advantage of 
        the pipeline is, that we execute parts of the instruction in 
        parallel with other parts of the instruction preceding the 
        instruction. Of course there are solutions for those problems 
        namely "Cache Memory" and "Branch prediction" which we will 
        discuss in the next part of this article. And after that I tell 
        you how it is possible to get CPI below 1! The feature making 
        this possible is called "Superscalarity".
Creature XL for Alive, 2005-12-22
Alive 12