Some Words on
programming the
ATARI Jaguar
Some things ahead: it's great to see another issue of ALIVE and
it's great to see our scene seems to be rather active currently.
I was really great to meet some of you guys again at outline
earlier this year, as it was the case with many stunning
releases. I'd like to spell out extra special thanks to Deez and
DHS for their efforts in creating the first stunning CT60
productions, to gwEm for his work on maxYMiser and misc.
interesting projects, to Defjam for his unexpected return to the
scene and his awesome smooth rotozoomer brought to us in this
issue's intro, and of course everyone else I forgot. concerning
my own coding I didn't have very much motivation/spare time to
do a lot of stuff recently as I'm very busy with my preparing
exams at this very moment, I hope you have a bit of
understanding and I hope it doesn't make myself seem a lazy
bastard again - be sure .tSCc. will have their release-comeback
in the future - maybe on a fairly 'unorthodox' platform, though
:).
So what is this article going to be all about? Well, it'd like
to take this opportunity to advertise the ATARI Jaguar as a
demo-coding platform a bit since, as some of you might already
know, I've badly fallen in love with the machines hardware,
especially from the point of view of a demo-coder. I had been
involved in coding all types of ATARI 16/32 bit machines such as
the ST, the Falcon and even the TT. and even if I didn't get
beyond a certain level of depth into those platforms hardware
all the time like many others (for instance I never fully
learned how to code the Falcon's DSP, to be honest merely due to
laziness) as ugly as it might sound none of those computers is
capable of providing a similar amount of horsepower when
exploited properly in my humble opinion.
I've heard people say, the Jaguar could have been so much better
if it didn't have that huge series of hardware bugs which make
coding this beast a mess besides the fact there aren't many
useable development kits to choose from. Let my try to
straighten things up a bit: first of all people are right if
claiming the major issue will be about getting a proper
development kit running, if you are interested in the different
types of available hardware I would suggest you to take a look
over the following site which provides a comprehensive list:
http://www.jaysmith2000.com/Development.html .
I've come to find Bastian Schick's BJL-kit to be quite useable
and rather easy to install, as well. It uses a hacked boot EPROM
and sort of a null-modem cable that plugs into your ATARI/PC
parallel port or Jaguar joypad interface, respectively.
Modifying your jaguar for BJL requires a tiny bit of knowledge
about electronics/soldering unless you are going to run risk of
damaging it. Feel free to drop me a line if you want me to
modify your jaggy at a party or some other occasion.
To me BJL seems a good choice since a bunch of French sceners
are currently busy with developing a brand new cable which plugs
into the PC's USB interface and new uploaders which are supposed
to be more reliable and also compatible with windoze 2000/xp and
Linux as opposed to the outdated uploading utilities provided by
Bastian Schick himself. Even a compact flash card which plugs
into your Jaguar's ROM connector is in the making by another
crazy Frenchman called SCPCD (hi there... :) ):
http://scpcd.free.fr/
The next question would be, which development software shall I
use? There's a set of a very stable and powerful macro assembler
called "MadMac" which was being used by ATARI Corp. itself,
which makes up a fully useable development environment along
with "aln" the accordant linker. both of those tools exist for
different platforms such as ATARI TOS, Linux and MS-DOS, but
beware, windoze (NT and above) users are forced to use an MS-DOS
emulator such as dosBox since the OS' build-in command line
doesn't get along with the DPMI used by madmac/aln at all which
will leave them in an unusable state, you'll be able to find it
here:
http://dosbox.sourceforge.net/
A remote debugger for all these platforms is available along
with alpine only, where as another debugger which works with BJL
is available for ATARI TOS systems (unfortunately it does not
work under MiNT though) while a brand new one is planned by the
authors of the new BJL cable (called "catnip" by the way), too.
It seems they might team up with swapD0 from Spain who has
already built his own remote debugger for BeOS though, at least.
I hope they'll do so, as swapD0 also made his own assembler
called "jas" which is now available for BeOS and windows and
which is supposed to become more madmac compatible in the future
as I assume, but it already looks promising:
http://www.freewebs.com/swapd0/tools.html .
An outdated version (v2.6) of a crossGCC for MS-DOS which is
able to produce aln linkable objects both into m68k and to the
Jaguar's RISC GPU/DSP compatible output but I'll use some more
words about that further below. swapD0 claims to be working on
his own Jaguar C compiler, so feel free to drop him a note for
support via his email (I believe he's stuck in problems with
ANSI C's type conversion rules while the basic frame work is
done).
Let's lose some words about rumours about the lots of hardware
bugs which are said to be restricting / avoiding some of the
machines features completely.
But before we continue I need to introduce a few more things.
What does the Jaguar offer that might make it so interesting to
a demo-coder in a nutshell?
(1) A Motorola 68000 CPU clocked at 13.3 MHz which is used as
the main processor, I guess most of you are familiar with
coding that one
(2) A custom chip called "Tom" clocked roughly at 26 MHz
unifying common graphic processing capabilities which
contains:
- A pipelined RISC GPU containing two banks of thirty-two 32bit
registers providing a fast MAC unit which is also able to
performing systolic matrix multiplications of up to 8x8
matrices (ideally suited for a fast DCT/iDCT), a separate
division unit, besides some special graphic processing related
instructions (for colour mixing, saturation instructions etc.)
plus 4KBytes of fast internal "cache". Rudimentary support for
floating point operations is implemented, too (namely a MTOI
and NORMI instruction, i.e. "mantissa to integer" and
"normalize integer").
- A special BLITTER to perform tasks such as scaling / texture
mapping blocks / scanlines of 1/4/8/16/24 bpp pixels at
16.16 bit precision, colour mixing of layers, hardware Gouraud
shading of single scanlines at 8.16 bit precision, hardware
Z-Buffering of single scanlines, and general block-transfer
tasks along with the usual logical operations in a phrase
(64 bit fetch) mode. All those operations can be limited to a
defined address window in order to provide a simple hardware
clipping.
- An object processor to process a linked list of bitmap- /
scaleable and transparent (read colour keyed) bitmap- and
misc. objects at 1/4/8/16/24 bpp. It supports various colour
spaces / types of colour encoding which are 1/4/8 bpp colour
mapped using a 16 bpp CLUT (RGB/CrY). 15/16 or 24 bpp (555,
565, 888) RGB colour spaces and a special 15/16 bpp (448)
Chroma-Luminance model which has been chosen to provide
support for very fast shading which just isn't as easy in the
RGB model. The CrY colour scheme is also better suited to
implement lossy image en/decoding algorithms such as existent
BPEG/JagPEG (JPEG alike) codecs and I'm already working on my
own implementation on a side note.
All elements of Tom are interconnected via a full 64 bit wide
data path while they can communicate with the main CPU via a 16
bit bus to do simple 16/32 bit transfers between CPU <-> GPU /
BLITTER or a special high speed 64 bit coprocessor bus to main
RAM when the GPU / BLITTER are driven in bus master transfers.
Bus sharing is also possible, which opens a whole new world of
pipelining possibilities (imagine the GPU computing the next
scanline, while the BLITTER transfers the previous scanline from
internal cache into local screen RAM and such). With pipeline
stalls removed all GPU instructions are able to execute in a
single cycle, except for external loads/stores and the division
which takes 18cycles (1+16+1 so 16 actually, excluding head and
tail) but which can be filled up with instructions that do not
have to wait for the operations result further below, so it
effectively executes in one clock cycle, as well with a careful
implementation - mind the division unit works in parallel with
the ALU.
(3) Another custom chip called "Jerry" clocked roughly at 27 MHz
unifying sound and communication capabilities which contains:
- Another pipelined RISC processor, the DSP, which is very
similar to the GPU regarding the programming model but which
has some instructions replaced by ones matching the
requirements of sound synthesis (modulo addressing for ring-
buffers etc.). Furthermore its MAC unit works at an extended
precision of 40bits (as opposed to the GPU's MAC unit which
works at 32 bits) and the internal cache buffer has been
extended to 8KBytes and a set of ROM wave-tables for sound
synthesis. Another difference to the GPU are the saturation
operations which have been replaced by signed versions and for
matching the 40bit precision here.
- Additionally, Jerry includes a clock controlling unit which
incorporates an UART, programmable timers, joystick interface
controller etc.
- A 16bit (14bit used) stereo audio DAC that should but doesn't
have to be fed by a DSP interrupt (you can use the other
processors, too)
What's been said about the GPU goes for the DSP too mostly. DSP
<-> CPU and DSP <-> main RAM transfers work via the 16 bit data
bus.
(4) A programmable video controller to drive PAL/NTSC RGB or TV
set displays up to maximum resolutions of 720x220 pixels (non-
interlaced) or 720x440 pixels (interlaced).
(5) 2 MBytes of DRAM.
Quite neat, don't you think? If you'd like to find out a bit
more about taming and programming the hardware described above
or if are going to learn the GPU/DSP assembler syntax used for
the two RISC processors I'd recommend you to take a look into
the following document, which gives a fairly detailed insight:
http://www.hillsoftware.com/files/atari/jaguar/jag_v8.pdf .
If you are familiar with 680x0 assembler (I bet you are) it
should be rather easy for you to pick up the new syntax which
has been designed rather close to the m68k's instruction-set
imho.
Let me now tell you something about pitfalls I had to overcome
in order to get started. First off, it isn't as bad as I thought
having heard many fairytales. I'll try to give you a little bug
list that might have some impact on your coding though:
(1) Neither the GPU nor the DSP can execute any code from main
RAM:
This isn't generally true; they'll only work unreliable to
unpredictable when it comes to executing (conditional) jumps
from main RAM. However this problem can be solved by using a
little cache management system which loads GPU/DSP program
sections from main into GPU/DSP RAM on demand prior to executing
the code on. Those sections won't need any relocation as long as
you stick to PC relative code or if you are about to putting
them into absolute cache locations using the assembler's .org
directive.
(2) Buggy GPU/DSP pipelines which doesn't insert stalls or
ignores previous ALU results in some situations:
This is something to be taken into account not only when you are
going to optimize your programs by rescheduling your code but
also when you encounter one of those uncommon combinations where
the above is true. Here are some examples applying to both GPU &
DPS unless noted differently:
(2.1) Consecutive divides:
div r0,r1 change to div r0,r1
div r1,r2 or r1,r1 ; 'Touch' register
div r1,r2
(2.2) Consecutive writes to the same register (very uncommon):
load (r3),r2 change to load (r3),r2
moveq #3,r2 or r2,r2 ; 'Touch' register
moveq #3,r2
; ^ r2=(r3) here ^ r2=3 here
;(who would do this anyway?)
(2.3) The pipeline will not update register values after a long
latency instructions (divide, external load) before an
indexed store.
div r0,r1 change to div r0,r1
store r1,(r14+4) or r1,r1 ; force the div
; to finish
store r1,(r14+4)
(3) Hardware Z buffering is in an unusual state due to hardware
bug:
This is not entirely true either. It's true that the Z buffer
data has to be phrase (i.e. 64bit) aligned which isn't always
the case in for instance a polygon's scanlines that might start
at an arbitrary pixel position.
There are basically two solutions to this problem. Either hold a
scanline based phrase aligned list of Z values so the Z
buffering will work on a scanline granularity which may produce
a better result than the painter's algorithm or you might
prepare your scanline in from a fixed and phrase aligned cache
position in GPU RAM before writing it to screen RAM. This might
come in handy when you let the BLITTER perform multiple passes
per scanline which should then be done in GPU RAM for speed
reasons anyway (e.g. Gouraud shaded texture mapping)
(4) The object list needs to be refreshed each screen redraw:
There's a faulty design to the object processor which causes it
to destroy some fields in the object list's records, meaning the
object list has to be rebuild each vbl. I'd recommend you
building the object list once to a primary buffer which is then
being copied to a second buffer actually used by the OLP (object
list pointer) during the vbl handler. At least this is the way
it's being done in doom and the way which seems to work reliable
- mind you have to swap the pointer before transferring it to
the OLP register since the OP expects words to in reversed
order! Rebuilding the list each vbl takes unnecessary time, so
copying from a pregenerated buffer appears decent to me unless
you have to handle a dynamic list. Using the BLITTER for copying
seems ridiculous too unless you got to maintain very large
sprite lists for instance.
The list should go on here but this is what I've come to find
the most sneaky situations to cause any problems at all. Please
also take care of the fact that the pipeline is designed in a
way to execute one more instruction after every jump instruction
so either insert a nop there to be sure or just don't mind the
next instruction to be executed when it is uncritical, this can
be nice when optimizing your code, as well. There are some more
rare situations where a nop must be inserted but these will be
recognized by madmac usually and it will also warn about those
(jump or movei after a jump for instance). I'd suggest referring
to the above jag_v8.pdf document for more information on
hardware and software bugs in the Jaguar.
another general rule not only when it comes to optimizing your
code should be refraining from external (DRAM) loads/stores when
ever possible as they tend to be a lot slower than internal
reads/writes in GPU/DSP RAM. this is the main reason for texture
mapping being noticeably slower than Gouraud shading which
doesn't require an additional source read in DRAM by the
BLITTER, because unfortunately the GPU RAM is far too small to
hold any reasonably sized textures. But behold, it's still much
faster then one might expect according to some peoples' annoying
rumours about the Jaguar's creepingly slow texture mapping when
implemented in a smart pipelined method, just take a look at
games like Iron Soldier or Hover Strike on that matter.
Before I move on to the starcat dev CD review, I will tell you
some more about the crossGCC mentioned above. To me it was the
hell of a deal to set up that chain of tools, but once I found
out how to do it, GCC ran fine and even recompiled the source
code to id-soft's DOOM. As it is capable of outputting BSD
linkable objects that will be eaten up by ALN the compiler can
be used along with separate MADMAC assembler modules quite well
which makes it a nice development environment especially for
larger projects such as games (I've come to find that writing a
full game completely in assembler isn't my cup of tea after sort
of finishing up wolf3d btw.). On the other hand it fucks up
under dosBox quite often and there isn't any set of readily
adapted/compiled libraries which is why I'm putting my hope into
swapd0 finishing his new JaguarC compiler. Here's what you'll
need to get started (you can find the required files on the
starcat dev CD or on the web - read below):
Depack the full set of GCC tools into a directory of your
harddrive called "bin" or something. It should now contain two
subfolders called "AGPU/2.6/" and "M68K/2.6" where the cross
compilers will reside. The next step is to set up some
environment variables (for dosBox etc.):
set PATH=directory_where_you_put_your_bin_utils
set LIBDIR=directory_where_you_put_gccs_libraries
set INCDIR=directory_where_you_put_include_files
set GCC_EXEC_PREFIX=/
# you should leave this blank
set TMPDIR=directory_where_you_want_temporary_files_to_be_put
# GCC
set TEMP=directory_where_you_want_temporary_files_to_be_put
# same for madmac
set RDBRC=path_to_search_for_RDB.RC
# alpine debugger users only
set DBPATH=path_to_search_for_the_debugger # dito
set ALNPATH=directory_where_you_put_ALN
set MACPATH=include_directory_for_madmac
Then you'll need to set up the build specs for GCC and the two
cross compilers (to be found in a file called "SPECS" in
"/bin/", "/bin/AGPU/2.6/" and "/bin/M68K/2.6"). make sure at
least the following directives are included and adapted to your
config:
(1) "/bin/SPECS" and "/bin/M68K/2.6/SPECS":
*asm:
-cgcc -fb
*cpp:
-nostdinc -I/put_your_include_directory_here/
*predefines:
-DJAGUAR -Amachine(JAGUAR) -DM68K -Acpu(M68K)
*cross_compile:
1
(2) "/bin/AGPU/2.6/SPECS":
*asm:
-cgcc -fb
*cpp:
-nostdinc -I/put_your_include_directory_here/
*predefines:
-DJAGUAR -Amachine(JAGUAR) -DAGPU -Acpu(AGPU)
*cross_compile:
1
At least for me it helped to set things going that way after a
lot of messing around. Feel free to drop me a line via email if
you encounter problems setting up the Jaguar crossGCC.
STAY COOL - STAY ATARI
ray / tSCc for Alive, 2006-08-10
Please note that due to the pure size of this article I decided
to split it after ray submitted it. You'll find the Review of
the Starcat Development CD which is mentioned above within the
(P)review category of this issue, enjoy...
cxt
|