CT60 BENCHMARKS
Introduction
------------
After years of waiting it has finally arrived ! (well, 6 months ago) : * THE *
accelerator for our beloved Falcon ! Elsewhere in this mag you can read CiH's
excellent article about the CT60, this is an additional text aimed to give you
an idea of how fast it is and a few ideas for coders.
Benchmarks
----------
Here is nembench dump from my CT60@66MHz and 20MHz bus. Resolution is 800x592
in mono so the STRAM tests are not 100% accurate but thats not the important
bit.
NemBench v2.1 - precision CPU/FPU profiler.
Integer multiply (16bit) -> 34.133 Mips (~5568%)
Integer divide (16bit) -> 3.029 Mips (~836%)
Linear (stalled) integer -> 66.601 Mips (~836%)
Interleaved (piped) integer -> 132.129 Mips (~1659%)
Float multiply (64bit) -> 22.755 MegaFlops (~8586%)
Float divide (64bit) -> 1.828 MegaFlops (~1056%)
Linear (stalled) float -> 32.768 MegaFlops (~6147%)
Interleaved (piped) float -> 32.768 MegaFlops (~6159%)
16bit read (100% hit) -> 131.578 MByte/sec (~1675%)
16bit write (100% hit) -> 131.578 MByte/sec (~2187%)
32bit read (100% hit) -> 263.157 MByte/sec (~1676%)
32bit write (100% hit) -> 263.157 MByte/sec (~3947%)
Linear 32bit read (ST-Ram) -> 6.313 MByte/sec (~118%)
Linear 32bit write (ST-Ram) -> 10.192 MByte/sec (~158%)
Linear 32bit copy (ST-Ram) -> 3.957 MByte/sec (~122%)
Linear 32bit read (FastRAM) -> 69.167 MByte/sec (~1301%)
Linear 32bit write (FastRAM) -> 68.089 MByte/sec (~1055%)
Linear 32bit copy (FastRAM) -> 27.193 MByte/sec (~842%)
Linear burst copy (ST-Ram) -> 3.868 MByte/sec (~119%)
Linear burst copy (FastRAM) -> 33.781 MByte/sec (~1046%)
Linear burst copy (ST->Fast) -> 6.012 MByte/sec (~186%)
Linear burst copy (Fast->ST) -> 8.668 MByte/sec (~268%)
Floating point figures are the most impressive, a floating point multiply is 85
(!) times faster on a 060 compared to a 030 with 68882. There is no reason to
stay away from the FPU anymore !
Memory wise it is also very impressive, 69MB/s reading from TTRAM is twice
as fast as the old CT2 and 13 times faster than the slow STRAM of a standard
falcon!
Now, what does this *really* tell us about the speed of a CT60? Not very much
to be honest. Nembench consists of a series of very specific tests and doesnt
give us an overall picture of how well it performs.
This was the problem I had about a year ago when I was writing a CT60 specific
3D engine using aranym. At the time, Didier Mequignon was the only person
(except for hardware wizard Rodolphe himself) with a CT60.
There was simply no way of telling how fast it would be when it was finished. I
spent hours reading the available benchmarks and discussing the matter on IRC.
I decided to send Didier a small test program and asked him nicely if he could
run it on his machine and email me the output to use as a reference for my
engine.
This program was bench.prg and it displayed a few hundred texture-mapped
polygons for a few seconds while keeping track of the number of frames it
managed to render. When finished it would output this number to a file to
give me an idea of the speed. (Being a demo programmer I had to go for some
sort of demo effect!)
This bench.prg gives you a better overall picture of speed, how the machine
'feels' like and how responsive it is. It also turned out to be a way of
telling whether the machine was working as expected or not. If you get
distorted graphics and/or flickery polygons then something is not right and
could be caused by STRAM-speed or dodgy FPU.
When I first tested bench.prg on my machine I got 1136 frames (about 24 frames
per second) but the optimised SDRAM manager caused memory corruption and most
programs crashed straight away.
When I finally got it working the way I wanted the result was 1211 with 20MHz
bus. For comparison my CT2 managed 271 frames and my Afterburner040 + nemesis
only 189 frames. Almost 5 times faster than a CT2 is not bad at all !
The poor performance of the 040 is mostly down to slow STRAM. I could have
limited the drawing area to get more accurate results from bench.prg but it
was only meant as a quick test anyway.
Bottlenecks
-----------
As mentioned in the previous paragraph slow STRAM caused my old Afterburner040
to choke on bench.prg and it really shows one of the main bottlenecks on the
Falcon. Even tough we have 100MHz 060s, bench.prg only reports 1524* frames, a
mere 20% speed increase compared to my own 060/66MHz, why is that?
The answer is easy, STRAM can only take so much data per second over a slow
bus.
It doesnt matter how many instructions the main CPU or DSP can process per
second if the bus is the bottleneck. This means filling the entire screen in
320x200x16bit is only slightly faster on a CT60 even compared to a standard
Falcon! (This is why moving windows around in realtime is still a bit sluggish
even on a 060.)
So what is the solution to this problem ?
There is only two ways of dealing with this, we can either send less data over
the bus or replace the whole interface itself. The latter is what the
SuperVidel would do. Hosting video buffers on a separate card, accessed a lot
faster by the 060 using its own interface.
Since the SuperVidel doesnt even exist yet the easiest way is to send less
data. In the desktop thats easy, lower resolution and/or less colours, thats
it. But how can we do this in a demo or game ? 16 bit true colour has always
been the obvious colour depth for demo coders, but this might be about to
change.
If we still want a chunky mode then a c2p (chunky-to-planar) is a good idea, it
has been used on Amiga since coders started doing texture-mapping back in the
early-mid 90s.
A c2p routine has one mission in life, to convert a buffer made up of chunky
data, 4/8 bits per pixel to bitplane data. I wont go into detail how it does
it, thats beyond the scope of this article.
If we use 8 bitplanes instead of true colour we can half the amount of data
sent over the bus ! We can also easily fade the colours since we are using
palettised graphics. The drawback is lack of colours and the time it takes to
convert the screen to bitplane data.
Thanks to the Amiga community the c2p is one of the most optimised technics
ever developed and we should take advantage of that knowledge!
It takes ~55% of a VGA 60Hz frame to convert and write one 320 x 200 pixel
large chunky-buffer to a bitplane buffer (puh). This may sound like a lot of
CPU time but consider that all pixel-writes are done using bytes to a
chunky-buffer in TTRAM which also means any polygon overdrawing is done to
TTRAM, speeding things up even more.
This makes c2p and 8bitplane mode an excellent alternative to true colour. The
lack of colours is not as bad as you may think, most textures can be converted
to 256 colours without losing any detail and the speed increase makes it well
worth it.
*100MHz 060 + 25MHz bus.
DSP vs CPU
----------
This matter has been covered before in an article by myself in Alive so I
will only briefly mention this.
The main theory behind DSP coding has always been to calculate as much as
possible using the DSP and only send small amounts of data over the slow
hostport. The only difference here is that by the time the host has read
the hostport, an 060 would have finished the job by itself !
This is obviously not the case for everything but for simple vertex
transformations and scanline rasterisation it is definitely not worth using
the DSP.
I havent yet tried a mandelbrot fractal on both DSP and CPU but it might be
a close call depending on what part of the fractal you're rendering so in the
end the DSP might be faster if zoomed in enough.
It seems the DSP has finally become the music mixer and sound generator it
was meant to be..
-----------------------------------------------------------------------------
Fredrik Egeberg (deez of evolution)
deez@algonet.se
-----------------------------------------------------------------------------
|