News Team Current issue History Online Support Download Forum @Pouet

01 - 02 - SE - 03 - 04 - 05 - 06 - 07 - 08 - 09 - 10 - 11 - 12 - 13 - 14

Alive 8


After years of waiting it has finally arrived ! (well, 6 months ago) : * THE *
accelerator for our  beloved Falcon ! Elsewhere in this mag you can read CiH's
excellent article about the CT60, this is an additional text aimed to give you
an idea of how fast it is and a few ideas for coders.


Here is nembench dump from my CT60@66MHz and 20MHz bus. Resolution is 800x592
in mono so the STRAM tests are  not 100% accurate but thats not the important

NemBench v2.1 - precision CPU/FPU profiler.

Integer multiply (16bit)     -> 34.133 Mips (~5568%)
Integer divide (16bit)       -> 3.029 Mips (~836%)
Linear (stalled) integer     -> 66.601 Mips (~836%)
Interleaved (piped) integer  -> 132.129 Mips (~1659%)

Float multiply (64bit)       -> 22.755 MegaFlops (~8586%)
Float divide (64bit)         -> 1.828 MegaFlops (~1056%)
Linear (stalled) float       -> 32.768 MegaFlops (~6147%)
Interleaved (piped) float    -> 32.768 MegaFlops (~6159%)

16bit read (100% hit)        -> 131.578 MByte/sec (~1675%)
16bit write (100% hit)       -> 131.578 MByte/sec (~2187%)
32bit read (100% hit)        -> 263.157 MByte/sec (~1676%)
32bit write (100% hit)       -> 263.157 MByte/sec (~3947%)

Linear 32bit read (ST-Ram)   -> 6.313 MByte/sec (~118%)
Linear 32bit write (ST-Ram)  -> 10.192 MByte/sec (~158%)
Linear 32bit copy (ST-Ram)   -> 3.957 MByte/sec (~122%)

Linear 32bit read (FastRAM)  -> 69.167 MByte/sec (~1301%)
Linear 32bit write (FastRAM) -> 68.089 MByte/sec (~1055%)
Linear 32bit copy (FastRAM)  -> 27.193 MByte/sec (~842%)
Linear burst copy (ST-Ram)   -> 3.868 MByte/sec (~119%)
Linear burst copy (FastRAM)  -> 33.781 MByte/sec (~1046%)
Linear burst copy (ST->Fast) -> 6.012 MByte/sec (~186%)
Linear burst copy (Fast->ST) -> 8.668 MByte/sec (~268%)

Floating point figures are the most impressive, a floating point multiply is 85
(!) times faster on a  060 compared to a 030  with 68882. There is no reason to
stay away from the FPU anymore !

Memory wise it is also very  impressive, 69MB/s reading from  TTRAM is twice
as fast as the old CT2 and 13 times faster than the slow STRAM of a standard

Now, what does this *really* tell us about the speed of a CT60? Not very much
to be honest. Nembench consists of a series of very specific tests and doesnt
give us an overall picture of how well it performs.

This was the problem I had about a year ago when I was writing a CT60 specific
3D engine  using  aranym. At the  time, Didier  Mequignon was  the only person
(except for hardware wizard Rodolphe himself) with a CT60.

There was simply no way of telling how fast it would be when it was finished. I
spent hours reading the available benchmarks and  discussing the matter on IRC.
I decided to send Didier a small test  program and asked him nicely if he could
run it on his  machine and  email me  the output  to use as  a reference for my

This program was bench.prg  and it displayed a few  hundred  texture-mapped
polygons for a few seconds  while keeping  track of the number of frames it
managed to render. When  finished it would output this  number to a file to
give me an idea of the speed. (Being a demo programmer I had to go for some
sort of demo effect!)

This bench.prg gives you a better overall picture of speed, how the machine
'feels' like  and how responsive  it is. It also turned out  to be a way of
telling whether the machine was  working as  expected  or  not. If  you get
distorted graphics and/or flickery polygons then something is not right and
could be caused by STRAM-speed or dodgy FPU.

When I first tested bench.prg on my machine I got 1136 frames (about 24 frames
per second) but the optimised SDRAM manager  caused memory corruption and most
programs crashed straight away.

When I finally got it working the way I wanted the result was 1211 with 20MHz
bus. For comparison my CT2 managed 271 frames and my Afterburner040 + nemesis
only 189 frames. Almost 5 times faster than a CT2 is not bad at all !

The poor performance of the 040 is mostly down to  slow STRAM. I could have
limited the drawing area to get more accurate results from bench.prg but it
was only meant as a quick test anyway.


As mentioned in the previous paragraph slow STRAM caused my old Afterburner040
to choke on bench.prg and it really  shows one of the main  bottlenecks on the
Falcon. Even tough we have 100MHz 060s, bench.prg only reports 1524* frames, a
mere 20% speed increase compared to my own 060/66MHz, why is that?

The answer is easy, STRAM can only take so much data per second over a slow

It doesnt matter how many  instructions the  main CPU or  DSP can  process per
second if the bus is  the bottleneck. This means  filling the entire screen in
320x200x16bit is  only slightly  faster on a CT60 even  compared to a standard
Falcon! (This is why moving windows around in realtime is still a bit sluggish
even on a 060.)

So what is the solution to this problem ?

There is only two ways of dealing with this, we can either send less data over
the  bus  or  replace  the  whole  interface  itself. The  latter is  what the
SuperVidel would do. Hosting video buffers on a separate  card, accessed a lot
faster by the 060 using its own interface.

Since the SuperVidel doesnt  even exist yet  the easiest way is to send less
data. In the desktop thats easy, lower resolution and/or less colours, thats
it. But how can we do this in a demo or game ? 16 bit true colour has always
been the  obvious colour  depth for demo coders, but this  might be about to

If we still want a chunky mode then a c2p (chunky-to-planar) is a good idea, it
has been used on Amiga since  coders started doing  texture-mapping back in the
early-mid 90s.

A c2p routine has one mission in life, to convert a buffer made up of chunky
data, 4/8 bits per pixel to bitplane data. I wont go into detail how it does
it, thats beyond the scope of this article.

If we use 8 bitplanes  instead of true  colour we can half the amount of data
sent over the  bus ! We can  also easily fade the  colours since we are using
palettised graphics. The drawback is lack of colours and the time it takes to
convert the screen to bitplane data.

Thanks to the Amiga community the c2p is one of the most optimised technics
ever developed and we should take advantage of that knowledge!

It takes ~55% of a VGA 60Hz frame  to convert and  write one 320 x 200 pixel
large chunky-buffer to a bitplane buffer (puh). This may sound like a lot of
CPU  time  but  consider  that  all pixel-writes  are done using  bytes to a
chunky-buffer in TTRAM which  also means any  polygon overdrawing is done to
TTRAM, speeding things up even more.

This makes c2p and 8bitplane mode an excellent alternative to true colour. The
lack of colours is not as bad as you may think, most textures can be converted
to 256 colours without losing any detail and  the speed increase makes it well
worth it.

*100MHz 060 + 25MHz bus.


This matter has been covered before in an article by myself in Alive so I
will only briefly mention this.

The main theory behind DSP coding has always been to calculate as much as
possible using the DSP and only send small amounts of  data over the slow
hostport. The only difference here is  that by the time the host has read
the hostport, an 060 would have finished the job by itself !

This is  obviously  not  the  case  for  everything  but  for simple vertex
transformations and scanline rasterisation it is definitely not worth using
the DSP.

I havent yet tried  a mandelbrot fractal on both DSP  and CPU but it might be
a close call depending on what part of the fractal you're rendering so in the
end the DSP might be faster if zoomed in enough.

It seems the DSP has finally become the music mixer and sound generator it
was meant to be..

Fredrik Egeberg (deez of evolution)

Alive 8