Comparison between
the
M68030 and M68060
(Part 1)
0. INTRODUCTION
================
Everybody knows that a Falcon with a ct60 board is faster than a
standard Falcon030. We know even more as Rodolphe Czuba (the
famous developer of the ct60) states on his Webpage (
http://www.czuba-tech.com/CT60/english/overview.htm ):
CPU MIPS Factor
68060 @ 100 MHz (CT60) 147.00 25.5
68060 @ 66 MHz (CT60) 96.80 16.8
68030 @ 16 MHz (Falcon / Motorola 2001) 5.76 1.0
Some people maybe made the following calculations:
66 MHz / 16 MHz = 4.125
So a 030 with 66MHz should have:
5.76 MIPS * 4.125= 23.76 MIPS
Finally, at the same frequency ( 66MHz) a 060 has:
96.80 MIPS / 23.76 MIPS = 4.07 times more MIPS.
Note: The above calculation is intended to be a motivation and
not an analysis.
It seems like there is something that makes the 060 four times
faster than an equally clocked 030. This fact is not surprising.
The 68030 was released in 1987 and the 68060 in 1994. And it
should be no secret that technology doesn't stand still in seven
years :)
For those who are curious how you can achieve a speedup of four
are welcome to learn more in this article. Those guys who are
used to coding the 030 and now want to get the most performance
out of this CPU will maybe find some interesting information
inside this article as well. My intention for this article is to
describe the architecture of the 68060 and compare it to the
68030. I totally understand that I am not the first who
"discovered" the 060. Just think about those AMIGA coders who
have a 060 speeder card for a long time and know every possible
trick. But maybe this article helps you to understand why some
things you heard about the 060 are like they are. Things like:
- "Align your code and data to 16 byte boundaries."
- "060 has 0-cycle branches."
- "060 is superscalar"
I assume, there are at least two kinds of readers. The first
ones are familiar with CPU design, but don't know the exact
features of the 060. The second one are those who have no idea
at all. For those I have included some hopefully good
explanations of important features which a CPU contains
nowadays. But this makes this whole article very lengthy at
least for the technically unaware. The techies may skip those
sections :)
1. INSTRUCTIONS IN THE PIPELINE
===============================
An important parameter for the CPU designer is the amount of
Cycles Per Instruction (CPI). This figure tells you how many
clock cycles a CPU needs to execute an instruction. Some
examples for 030 and 060:
Instruction 68030 68060
1 move.l d0, d1 2 1
2 add.l d0, d1 2 1
3 lsl.l #6, d0 2 1
4 mulu.w d0, d1 38 2
5 divu.w d0, d1 78 36
6 add.l (a0)+,d0 5 1
7 add.l (a0,d7.w*2),d0 8 1
Although you can see that the CPI varies from instruction to
instruction, it can be said that the 060 has CPI = 1.
Instructions which take more cycles are special instructions
like ReTurn from Subroutine ( RTS) or the DIV and MUL
instructions, but comparing to the 030 they are nonetheless
fast. Now let's have a look at the lines 2, 6 and 7. On the 030
side you should read the cycles count in line 6 and 7 like "2+3"
and "2+6" respectively or a bit formal "I+A".
I: cycles to execute the operation
A: cycles needed to calculate address of operand.
Q: How come that the 060 needs 1 cycle regardless of the
addressing mode?
A: It uses pipelining.
Pipelining is often used throughout a lot of areas in computing
science, so we start with a
DEFINITION:
-----------
Suppose, a task A is divideable into a series of smaller tasks
A_1, A_2,..., A_n, so that executing the series A_1 to A_n has
the same effect as executing A. If we have the possibility to
execute the subtasks A_1 to A_n in parallel, then we talk about
PIPELINING.
EXAMPLE:
--------
Consider the task "Building a car". You can have a lot of
workers in a garage who put auto body, tyres, window pane and so
on together to build a car. We assume that it takes every worker
an hour to do his job. Having 5 workers a car is produced every
5 hours.
Now picture an assembly line. After the first worker has fixed
the car body ( after one hour) and the car has moved to the 2nd
worker, he can start putting the next car body onto the next car
and so on. Still it takes 5 hours to build the FIRST car, but
after that, a car is produced every hour.
With the assembly garage you need 20 hours to build 4 cars. With
the assembly line you need just 5+3 = 8 hours.
The 68060 Pipeline
------------------
A first (not yet complete) look onto the pipeline of the 060:
|-----------------------|
| |
| DECODE |
| |
| |
| EA CALC |
| |
| |
| EA FETCH |
| |
| |
| EXECUTE |
| |
|-----------------------|
Note: EA is a common abbreviation for Effective Address.
In the following, I describe the subtasks which are executed in
the phases noted above. To illustrate it a bit I will ask an
example to accompany us. The example will be the ADD
instruction. An instruction is presented to the CPU as an 16-bit
instruction word. The ADD instruction word looks like this:
|1101| REGISTER | OPMODE | EA |
Where "1101" is a four bit code, which is used by the DECODE
phase to recognize the instruction (-type). "register" is the
bitmask which represents the CPU register used in the
instruction. "opmode" describes whether the register is the
source or the destination operand. "ea" is the code for the
address of the second operand. This could be as simple as an
other register or a (complex) address in memory. We will
consider the following instruction:
ADD d3, 25(a0,d1*2)
This instruction adds the value which is inside the register d3
to the value in memory at the specified address.
Now the description of the phases:
DECODE: in this phase the CPU recognizes the instruction and
starts transferring data from registers to the ALU.
EA CALC:the addresses of opcodes is calculated in this phase.
The CPU has extra hardware for the rather simple calculations.
This is needed for the pipelining, else the ALU of the CPU could
be used. Considering the example, the value "25" is added to the
contents of the register a0 and the contents of d1 ( multiplied
by 2) as well.
EA FETCH: data from the memory is fetched in this phase.
Considering the example, the value from memory is fetched at the
address calculated by the EA CALC phase and passed to the ALU.
EXECUTE: after all data is present at the ALU, the instruction
can finally be executed.
In short, that an instructions execution time is independent of
the addressing mode is due to the fact that the calculation and
fetching of the data is done while other instructions are in the
execution phase. It should be noted that the slowest phase is
crucial for the overall performance. If any of the phases of the
pipeline needs more than one cycle the instruction needs more
than one cycle as well.
Above the pipeline of the 68060 is described, but what about the
68030? The 030 has a simple pipelined architecture. But this is
only for the decoding. The decoding is splitted-up into three
steps which get executed in parallel. The address calculation
and the real actual execution of the instruction is not.
Summary and outlook
===================
To sum up the above, the pipeline is one key feature which makes
the 68060 faster than the 68030. As I stated above, the 060 as a
CPI equal to 1 for most of the standard instructions because of
the instruction pipeline. But this is only true if a few
assumptions are made!
First, when data is to be fetched from memory the CPU has to
wait for the data ( wait-states) and the second, what if we
change the flow of instructions? Keep in mind, the advantage of
the pipeline is, that we execute parts of the instruction in
parallel with other parts of the instruction preceding the
instruction. Of course there are solutions for those problems
namely "Cache Memory" and "Branch prediction" which we will
discuss in the next part of this article. And after that I tell
you how it is possible to get CPI below 1! The feature making
this possible is called "Superscalarity".
Creature XL for Alive, 2005-12-22
|