News Team Current issue History Online Support Download Forum @Pouet

01 - 02 - SE - 03 - 04 - 05 - 06 - 07 - 08 - 09 - 10 - 11 - 12 - 13 - 14

Alive 5

This article is meant to  give you some  guidelines to  understanding  how  the
Falcon should  be programmed  if you want your production to be compatible with
accelerated machines. Hopefully I can clear up some misunderstandings regarding
FPU and memory usage.

Since its release in late 92 we have seen numerous great demos for our beloved
bird, who can forget System  by eKo or Sono  by Avena only  to name two. These
and many other demos were  great and still are  but there is one problem, they
are not  very compatible with accelerated Falcons, in fact most of them refuse
to run at all on  anything else but a  16MHz system. This wasn't a big problem
back in the mid 90s since accelerators were not that common and if you had one
you could  usually turn  it off, either by  a piece of software or a switch on
the back of  your  machine. Still today, 10  years after its  release we still
see  a lot  of demos  released for  standard Falcons  only which  can be a bit
frustrating if you have a really  fast machine but no decent  software to show
what it can do. For some  reason  accelerators  never really caught  on in the
atariworld, atleast nothing like among amiga users where a 060 board is pretty
much standard these days.

This is a bit unfortunate for us since demos tend to either be hardcoded for
standard Falcons or designed  with any Falcon  in mind, this means  that the
latter usually looks a  bit "weaker" when running on a slow machine since it
cannot take advantage of fixed  framerate and things like  that. It is a lot
easier to program  a demo when  you know  that it will  look the same on all
machines, you  can pregenerate a  lot more data if you dont have to consider
variable framerate.

Fixed framerate is when you have say  a sprite being drawn and updated at the
same time, you  update the  position and  then you  draw the  sprite in  your
mainloop. This worked fine on STs and to some extend on  Falcons. The problem
is when somebody with a machine twice as fast as yours tries to run your demo
the mainloop with update and draw will be  executed twice  as many  times per
second as it would on  yours resulting in less time (in seconds) to reach the
desired position and it usually looks bad.

The solution to this is to have all the "updates" run on a timer. If you setup
a timer to trigger  100 times per  second you  will always get an interrupt at
the same time on every machine, no matter how fast it is, where you can update
your sprite position. You keep the drawing part in your mainloop and this will
execute as many times per second as your Falcon can do. Using this method your
demos will run better on  accelerated Falcons and  you can be sure  the sprite
will move as expected across the screen.

Things like  the above  example has to  be considered  when programming and
hopefully this document can help you understand why some programmers prefer
hardcoding their demos while others do things more compatible.

1) FPU
The 68882 is usually clocked to 16MHz and is connected to the CPU via a 16bit
wide interface, this means that it is anything  but fast. It  should never be
used for  innerloop  calculations but  rather  for table  generators or small
operations in the mainloop. The  good thing  is that  it can handle  very big
numbers with up to 96bit  precision, though 32bit is usually enough.

Apart from a few controlregisters the FPU has got 8 dataregisters, fp0-fp7,
each up to 96bit big. They are used for all internal operations.

Since 68882 has got built-in trig functions such as sine/cosine, pregenerated
fixedpoint sinetables are no longer  needed and  can be replaced  with either
runtime instructions or generated with great accuracy in the beginning of the

Deriving sine of a number has never been easier, simply do:

   fsin fp0,fp1

where fp0 is your number and fp1 will become sine(fp0).

On falcon the FPU is optional and even though not many demos make use of it
this will more than likely change when the CT60 becomes available with its
fast built-in FPU.

Reason why democoders would  use the FPU is  accuracy and  lazyness I'd say,
usually fixedpoint is enough for most demos but when you start doing heavier
calculations  such normalizing  vectors  etc the FPU becomes invaluable with
the great  accuracy it  provides. As  for the laziness it  is simply so much
easier to use fsin and fsqrt than to use tables.

As mentioned above, the 040 and 060 both comes in models with integrated FPU,
this however is a cutdown version of the  68882 and even  though it features
most of the instructions, some are left out which means emulation is used to
maintain compatibility.

On 060 the 64 bit version of mul/div is  also  left  out  which  means either
emulation of of those missing instructions or the use of  FPU multiplications
when doing high precision calculations, this means that most 060 applications
will choke a standard  Falcon not only  because of the faster CPU but because
FPU  instructions  are  used  to replace  the, on 060, slower CPU  equivalent
instructions. On 060  a normal  muls.l d0,d1 takes 2  clockcycles  and an FPU
multiplication  takes  3 clockcycles, the  difference is  usually  negligible
considering the extra accuracy you get with the FPU version.

2) DSP
The Falcon DSP is a beast, no doubt, it makes it possible to  replay mp3s on a
plain  16MHz  Falcon  aswell  as  calculating  fractals  and  everything  else
that requires a lot of  multiplications, because that is really  what its good
at, multiplying  numbers. Especially  when concatenating  matrices or applying
them on  vectors (vertices) since you have  the MAC  instruction, Multiply and
Accumulate, basically  multiply  two numbers and  add the  result  to a  third
number (accumulator). Throw in  a "round" and  "negate" together with parallel
moves and you end up with code quick enough to make even  the most overclocked
030 jealous. This  is  why  demos which  employ DSP code for  both  music  and
graphics can do so many  more polygons and  rotations than most "normal" demos
can. However this comes at a price, compatibility. There is a reason why demos
like Sono and  Hmmm dont work on  accelerated Falcons, it's  difficult to keep
demos compatible  with  faster Falcons when  trying to squeeze  every possible
ounce of  performance  from the  machine. Keeping DSP-host  transfers  in sync
can be very difficult when  you have to consider  faster  machines, its easier
to hardcode things.

When programming graphical effects you are most likely to use the hostport for
communication between DSP and host cpu. This  interface is  an 8bit bus acting
like a 24bit dito, three 8bit parts, high, middle and low byte. You can either
read/write the last two  bytes (middle  and low) giving you 16bit or all three
bytes plus  the high  byte yielding a  full 32bit long word, the high byte  is
always ignored and will read as  zero and wont cause a bus  error. This is all
nice and  easy to  use but  there is one  problem, speed. Even though  the DSP
itself is very  quick, interface to the  host is  not, speed is about  half of
that of STram or roughly 2.5MB/second. This  means that you will only  benefit
from using DSP code when the per-dsp-transfer cost using  the CPU is less than
two 2 STram read/write. This is ofcourse  a very rough figure but it gives you
and idea how slow the interface really is.

An obvious advantage with using the DSP is parallel processing, since the DSP
is  a  completely  seperate  processor  it  should  be  used  accordingly, ie
calculating data while  the host is busy  drawing or clearing  the screen and
when the host is finished you  transfer all data, this to get the most out of

3) Memory
A standard Falcon comes with 1, 4 or 14MB of ram (STram) but I doubt there
are any falconusers out there with only 1 mb in their machines.

Upgrades to 14MB  are available  from a few  places and well worth the money.
Together with an FPU I consider 14MB being the best  upgrade you can do since
a lot of applications  and demos  require more  than 4 MB. For one  reason or
another, some 14MB boards tend to be very sensitive to overclocking, problems
range from slight pixelflicker to very unstable machines. Especially CT2
machines have proven unstable with the wrong memory and most homemade 14 MB
fail to work properly.

A standard Falcon can only  use STram but there are quite few boards out there
with TTram support. The difference  between ST- and TTram is that audio/screen
buffers can _only_ be placed in STram whereas code can be placed in both STram
and TTram. TTram  is  also  usually a lot  faster and the  board can hold more
memory. Speed difference is significant  especially on  the Afterburner040 and
upcoming CT60. For ab040, TTram read is about 35 MB/second while STram is only
half of that of a standard  Falcon, about 3 MB/second (due to technical issues
with  040+  processors, this  problem  has  been solved  on CT60  according to
Rodolphe Czuba)

Use of TTram is controlled with 2 bits in header of an executable, TTram-mem
which, if set, defaults  all  memory  allocated  with Malloc() to TTram, and
TTram-load which, if set, forces the program to load into TTram. If no TTram
is available STram will  be used. Of course you  can allocate both STram and
TTram in your program using Mxalloc() where you can specify what memory type
you want.
When programming  with  compatibility in  mind  and  to  make your application
run as  fast as possible you should always try to load the  program into TTram
by setting the propriate bit in the program header (this can be done using for
example fileflag CPX or Thing desktop) and for screen/audio buffers  you would
have to use Mxalloc() as mentioned above to allocate STram.

Another, less obvious pitfall is hardware registers, not only  does the Falcon
lack shadowregisters available on the ST but since some addon boards are fully
32bit, accessing  $ffxxxx will  cause  problems, use  the  full  32bit address
instead, $ffffxxxx to avoid problems with expansion boards.

For demos there used to be a magic 4MB  memory limit but recently we have seen
more and more  productions  requiring  14MB, this  is mostly due to music, the
Falcon is able to  replay mp2  music at  high quality  using DSP in  demos but
there  is a  flipside to  this coin. A  4  minute  mp2, 96kBit/32kHz takes  up
almost 3MB so unless  you stream  the music from harddrive you have to load it
all into STram and this along with 500kb for screen  buffers leaves you little
or no memory left for code and graphics on a 4MB machine. This means that most
demos with  mp2 music are  14MB  only. Please  keep in  mind what I  mentioned
above, use TTram  whereever possible, otherwise we end up with 14MB STram-only

4) Blitter
On STE the blitter is really  nice and  has got a  big advantage  over the CPU
when it  comes to  copying and  shifting data  around. On Falcon however its a
different matter, even though the blitter is running at 16MHz compared to 8MHz
on STE, it is  simply too  slow  to be  useful and  probably only  present for
compatibility  reasons. And  since it can  only access  STram it makes it even
more useless so my advice is to leave the blitter completely on Falcon.

5) Videl
Videl is the graphics  processor in  the Falcon  so unless  you  are using  an
external graphicscard of  some kind, this is  where  the  all graphical output
comes from. As mentioned in  the memory  section Videl  can only use STram for
framebuffers  and  its resolution  can be set either by using XBIOS or writing
to the hardware registers themselves. The  latter gives  you more control over
the resolution and is the  prefered  method by  many coders since  there is an
excellent   tool  for  this  job, Screens  Pain. It  allows  you  do  set  any
resolution  the  Videl  is  capable  of  aswell   as  number  of  colours  and
frequencies. On the  topic of frequencies, speeders  usually replace the 32MHz
Videlclock  with a  50MHz one  which means  that 32MHz fed  resolutions are no
longer  working  on  accelerated  machines  and  should  be avoided. Use 25MHz
instead for both RGB and  VGA. Unfortunately this means that a 320 pixels wide
screen on RGB becomes stretched and  the left- and  rightmost sections are not
fully visible  anymore  unless  you can  adjust your  monitor/tv  but this  is
usually not a problem. When resetting resolution you might end up with corrupt
resolution or a slightly  offset screen, this is due to a  bug in XBIOS, there
are ways of restoring the Videl but  you might still have problems with offset
screens but this is  rare and  setting and  resetting resolution again usually
fixes the problem.


Most important thing to  remain  compatibility with  faster Falcons is  to not
use  hardcoded  DSP  transfers since  they will  not work  on anything but the
machine you made it for. The TTram vs. STram problem can be solved by clearing
the flags in the program header but this has to  be considered bad programming
since its very easy to allocate memory for your purposes.

- keep updates on timers but render the frame in the mainloop
- allocate STram for screen and audio buffers using Mxalloc().
- dont use the blitter (at all!).
- avoid hardcoded DSP transfers (never assume that the CPU and DSP frequency
  ratio is 1:2 etc).

If you follow these simple rules you can be fairly certain that your demo will
work properly on any Falcon. If you  have any questions  regarding this matter
(or anything else) dont  hesitate to contact me, either using email or on IRC,

For more info, download  Alive #1 from  and read Evil's
coding tutorial. You can also download his demosys which included init/restore
routines aswell as a timerbased effect system, get it from:

This archive also included a small testdemo as an introduction.

Fredrik Egeberg (deez of mind-design)

Alive 5