Alive 13 - jagcode

01 - 02 - SE - 03 - 04 - 05 - 06 - 07 - 08 - 09 - 10 - 11 - 12 - 13 - 14
Alive 13
Some Words on
programming the
ATARI Jaguar


        Some things ahead: it's great to see another issue of ALIVE and 
        it's great to see our scene seems to be rather active currently. 
        I was really great to meet some of you guys again at outline 
        earlier this year, as it was the case with many stunning 
        releases. I'd like to spell out extra special thanks to Deez and 
        DHS for their efforts in creating the first stunning CT60 
        productions, to gwEm for his work on maxYMiser and misc. 
        interesting projects, to Defjam for his unexpected return to the 
        scene and his awesome smooth rotozoomer brought to us in this 
        issue's intro, and of course everyone else I forgot. concerning 
        my own coding I didn't have very much motivation/spare time to 
        do a lot of stuff recently as I'm very busy with my preparing 
        exams at this very moment, I hope you have a bit of 
        understanding and I hope it doesn't make myself seem a lazy 
        bastard again - be sure .tSCc. will have their release-comeback 
        in the future - maybe on a fairly 'unorthodox' platform, though 
        :).

        So what is this article going to be all about? Well, it'd like 
        to take this opportunity to advertise the ATARI Jaguar as a 
        demo-coding platform a bit since, as some of you might already 
        know, I've badly fallen in love with the machines hardware, 
        especially from the point of view of a demo-coder. I had been 
        involved in coding all types of ATARI 16/32 bit machines such as 
        the ST, the Falcon and even the TT. and even if I didn't get 
        beyond a certain level of depth into those platforms hardware 
        all the time like many others (for instance I never fully 
        learned how to code the Falcon's DSP, to be honest merely due to 
        laziness) as ugly as it might sound none of those computers is 
        capable of providing a similar amount of horsepower when 
        exploited properly in my humble opinion.

        I've heard people say, the Jaguar could have been so much better
        if it didn't have that huge series of hardware bugs which make
        coding this beast a mess besides the fact there aren't many 
        useable development kits to choose from. Let my try to
        straighten things up a bit: first of all people are right if
        claiming the major issue will be about getting a proper 
        development kit running, if you are interested in the different 
        types of available hardware I would suggest you to take a look
        over the following site which provides a comprehensive list:

        http://www.jaysmith2000.com/Development.html .

        I've come to find Bastian Schick's BJL-kit to be quite useable
        and rather easy to install, as well. It uses a hacked boot EPROM
        and sort of a null-modem cable that plugs into your ATARI/PC
        parallel port or Jaguar joypad interface, respectively.
        Modifying your jaguar for BJL requires a tiny bit of knowledge
        about electronics/soldering unless you are going to run risk of
        damaging it. Feel free to drop me a line if you want me to
        modify your jaggy at a party or some other occasion.

        To me BJL seems a good choice since a bunch of French sceners
        are currently busy with developing a brand new cable which plugs 
        into the PC's USB interface and new uploaders which are supposed
        to be more reliable and also compatible with windoze 2000/xp and
        Linux as opposed to the outdated uploading utilities provided by
        Bastian Schick himself. Even a compact flash card which plugs
        into your Jaguar's ROM connector is in the making by another
        crazy Frenchman called SCPCD (hi there... :) ):

        http://scpcd.free.fr/

        The next question would be, which development software shall I 
        use? There's a set of a very stable and powerful macro assembler
        called "MadMac" which was being used by ATARI Corp. itself,
        which makes up a fully useable development environment along
        with "aln" the accordant linker. both of those tools exist for
        different platforms such as ATARI TOS, Linux and MS-DOS, but
        beware, windoze (NT and above) users are forced to use an MS-DOS
        emulator such as dosBox since the OS' build-in command line
        doesn't get along with the DPMI used by madmac/aln at all which
        will leave them in an unusable state, you'll be able to find it
        here:

        http://dosbox.sourceforge.net/

        A remote debugger for all these platforms is available along
        with alpine only, where as another debugger which works with BJL
        is available for ATARI TOS systems (unfortunately it does not
        work under MiNT though) while a brand new one is planned by the
        authors of the new BJL cable (called "catnip" by the way), too.
        It seems they might team up with swapD0 from Spain who has
        already built his own remote debugger for BeOS though, at least.
        I hope they'll do so, as swapD0 also made his own assembler
        called "jas" which is now available for BeOS and windows and
        which is supposed to become more madmac compatible in the future
        as I assume, but it already looks promising:

        http://www.freewebs.com/swapd0/tools.html .

        An outdated version (v2.6) of a crossGCC for MS-DOS which is
        able to produce aln linkable objects both into m68k and to the
        Jaguar's RISC GPU/DSP compatible output but I'll use some more
        words about that further below. swapD0 claims to be working on
        his own Jaguar C compiler, so feel free to drop him a note for
        support via his email (I believe he's stuck in problems with
        ANSI C's type conversion rules while the basic frame work is
        done).

        Let's lose some words about rumours about the lots of hardware
        bugs which are said to be restricting / avoiding some of the
        machines features completely.

        But before we continue I need to introduce a few more things.
        What does the Jaguar offer that might make it so interesting to
        a demo-coder in a nutshell?

        (1) A Motorola 68000 CPU clocked at 13.3 MHz which is used as
            the main processor, I guess most of you are familiar with
            coding that one

        (2) A custom chip called "Tom" clocked roughly at 26 MHz
            unifying common graphic processing capabilities which
            contains:

        - A pipelined RISC GPU containing two banks of thirty-two 32bit
          registers providing a fast MAC unit which is also able to
          performing systolic matrix multiplications of up to 8x8
          matrices (ideally suited for a fast DCT/iDCT), a separate
          division unit, besides some special graphic processing related
          instructions (for colour mixing, saturation instructions etc.)
          plus 4KBytes of fast internal "cache". Rudimentary support for
          floating point operations is implemented, too (namely a MTOI
          and NORMI instruction, i.e. "mantissa to integer" and
          "normalize integer").

        - A special BLITTER to perform tasks such as scaling / texture
          mapping blocks / scanlines of 1/4/8/16/24 bpp pixels at
          16.16 bit precision, colour mixing of layers, hardware Gouraud
          shading of single scanlines at 8.16 bit precision, hardware
          Z-Buffering of single scanlines, and general block-transfer
          tasks along with the usual logical operations in a phrase
          (64 bit fetch) mode. All those operations can be limited to a
          defined address window in order to provide a simple hardware
          clipping.

        - An object processor to process a linked list of bitmap- /
          scaleable and transparent (read colour keyed) bitmap- and
          misc. objects at 1/4/8/16/24 bpp. It supports various colour
          spaces / types of colour encoding which are 1/4/8 bpp colour
          mapped using a 16 bpp CLUT (RGB/CrY). 15/16 or 24 bpp (555,
          565, 888) RGB colour spaces and a special 15/16 bpp (448)
          Chroma-Luminance model which has been chosen to provide
          support for very fast shading which just isn't as easy in the
          RGB model. The CrY colour scheme is also better suited to
          implement lossy image en/decoding algorithms such as existent
          BPEG/JagPEG (JPEG alike) codecs and I'm already working on my
          own implementation on a side note.

        All elements of Tom are interconnected via a full 64 bit wide 
        data path while they can communicate with the main CPU via a 16
        bit bus to do simple 16/32 bit transfers between CPU <-> GPU /
        BLITTER or a special high speed 64 bit coprocessor bus to main
        RAM when the GPU / BLITTER are driven in bus master transfers.
        Bus sharing is also possible, which opens a whole new world of
        pipelining possibilities (imagine the GPU computing the next
        scanline, while the BLITTER transfers the previous scanline from
        internal cache into local screen RAM and such). With pipeline
        stalls removed all GPU instructions are able to execute in a
        single cycle, except for external loads/stores and the division
        which takes 18cycles (1+16+1 so 16 actually, excluding head and
        tail) but which can be filled up with instructions that do not
        have to wait for the operations result further below, so it
        effectively executes in one clock cycle, as well with a careful
        implementation - mind the division unit works in parallel with
        the ALU.

        (3) Another custom chip called "Jerry" clocked roughly at 27 MHz 
            unifying sound and communication capabilities which contains:

        - Another pipelined RISC processor, the DSP, which is very
          similar to the GPU regarding the programming model but which
          has some instructions replaced by ones matching the
          requirements of sound synthesis (modulo addressing for ring-
          buffers etc.). Furthermore its MAC unit works at an extended
          precision of 40bits (as opposed to the GPU's MAC unit which
          works at 32 bits) and the internal cache buffer has been
          extended to 8KBytes and a set of ROM wave-tables for sound
          synthesis. Another difference to the GPU are the saturation
          operations which have been replaced by signed versions and for
          matching the 40bit precision here.

        - Additionally, Jerry includes a clock controlling unit which
          incorporates an UART, programmable timers, joystick interface
          controller etc.

        - A 16bit (14bit used) stereo audio DAC that should but doesn't
          have to be fed by a DSP interrupt (you can use the other
          processors, too)

        What's been said about the GPU goes for the DSP too mostly. DSP
        <-> CPU and DSP <-> main RAM transfers work via the 16 bit data
        bus.

        (4) A programmable video controller to drive PAL/NTSC RGB or TV
        set displays up to maximum resolutions of 720x220 pixels (non-
        interlaced) or 720x440 pixels (interlaced).

        (5) 2 MBytes of DRAM.

        Quite neat, don't you think? If you'd like to find out a bit
        more about taming and programming the hardware described above
        or if are going to learn the GPU/DSP assembler syntax used for
        the two RISC processors I'd recommend you to take a look into
        the following document, which gives a fairly detailed insight:

        http://www.hillsoftware.com/files/atari/jaguar/jag_v8.pdf .


        If you are familiar with 680x0 assembler (I bet you are) it
        should be rather easy for you to pick up the new syntax which
        has been designed rather close to the m68k's instruction-set
        imho.

        Let me now tell you something about pitfalls I had to overcome
        in order to get started. First off, it isn't as bad as I thought
        having heard many fairytales. I'll try to give you a little bug
        list that might have some impact on your coding though:


        (1) Neither the GPU nor the DSP can execute any code from main 
            RAM:

        This isn't generally true; they'll only work unreliable to
        unpredictable when it comes to executing (conditional) jumps
        from main RAM. However this problem can be solved by using a
        little cache management system which loads GPU/DSP program
        sections from main into GPU/DSP RAM on demand prior to executing
        the code on. Those sections won't need any relocation as long as
        you stick to PC relative code or if you are about to putting
        them into absolute cache locations using the assembler's .org
        directive.


        (2) Buggy GPU/DSP pipelines which doesn't insert stalls or 
            ignores previous ALU results in some situations:

        This is something to be taken into account not only when you are
        going to optimize your programs by rescheduling your code but
        also when you encounter one of those uncommon combinations where
        the above is true. Here are some examples applying to both GPU &
        DPS unless noted differently:


        (2.1) Consecutive divides:

        div r0,r1       change to    div r0,r1
        div r1,r2                    or  r1,r1     ; 'Touch' register
                                     div r1,r2


        (2.2) Consecutive writes to the same register (very uncommon):

        load  (r3),r2   change to    load  (r3),r2
        moveq #3,r2                  or    r2,r2   ; 'Touch' register
                                     moveq #3,r2
        ;        ^ r2=(r3) here               ^ r2=3 here  
        ;(who would do this anyway?)


        (2.3) The pipeline will not update register values after a long 
              latency instructions (divide, external load) before an 
              indexed store.

        div   r0,r1     change to    div   r0,r1
        store r1,(r14+4)             or    r1,r1   ; force the div
                                                   ; to finish
                                     store r1,(r14+4)

        (3) Hardware Z buffering is in an unusual state due to hardware 
            bug:

        This is not entirely true either. It's true that the Z buffer
        data has to be phrase (i.e. 64bit) aligned which isn't always
        the case in for instance a polygon's scanlines that might start
        at an arbitrary pixel position.

        There are basically two solutions to this problem. Either hold a
        scanline based phrase aligned list of Z values so the Z
        buffering will work on a scanline granularity which may produce
        a better result than the painter's algorithm or you might
        prepare your scanline in from a fixed and phrase aligned cache
        position in GPU RAM before writing it to screen RAM. This might
        come in handy when you let the BLITTER perform multiple passes
        per scanline which should then be done in GPU RAM for speed
        reasons anyway (e.g. Gouraud shaded texture mapping)

        (4) The object list needs to be refreshed each screen redraw:

        There's a faulty design to the object processor which causes it
        to destroy some fields in the object list's records, meaning the
        object list has to be rebuild each vbl. I'd recommend you
        building the object list once to a primary buffer which is then
        being copied to a second buffer actually used by the OLP (object
        list pointer) during the vbl handler. At least this is the way
        it's being done in doom and the way which seems to work reliable
        - mind you have to swap the pointer before transferring it to
        the OLP register since the OP expects words to in reversed
        order! Rebuilding the list each vbl takes unnecessary time, so
        copying from a pregenerated buffer appears decent to me unless
        you have to handle a dynamic list. Using the BLITTER for copying
        seems ridiculous too unless you got to maintain very large
        sprite lists for instance.


        The list should go on here but this is what I've come to find
        the most sneaky situations to cause any problems at all. Please
        also take care of the fact that the pipeline is designed in a
        way to execute one more instruction after every jump instruction
        so either insert a nop there to be sure or just don't mind the
        next instruction to be executed when it is uncritical, this can
        be nice when optimizing your code, as well. There are some more
        rare situations where a nop must be inserted but these will be
        recognized by madmac usually and it will also warn about those
        (jump or movei after a jump for instance). I'd suggest referring
        to the above jag_v8.pdf document for more information on
        hardware and software bugs in the Jaguar.

        another general rule not only when it comes to optimizing your
        code should be refraining from external (DRAM) loads/stores when
        ever possible as they tend to be a lot slower than internal
        reads/writes in GPU/DSP RAM. this is the main reason for texture
        mapping being noticeably slower than Gouraud shading which
        doesn't require an additional source read in DRAM by the
        BLITTER, because unfortunately the GPU RAM is far too small to
        hold any reasonably sized textures. But behold, it's still much
        faster then one might expect according to some peoples' annoying
        rumours about the Jaguar's creepingly slow texture mapping when
        implemented in a smart pipelined method, just take a look at
        games like Iron Soldier or Hover Strike on that matter.

        Before I move on to the starcat dev CD review, I will tell you
        some more about the crossGCC mentioned above. To me it was the
        hell of a deal to set up that chain of tools, but once I found
        out how to do it, GCC ran fine and even recompiled the source
        code to id-soft's DOOM. As it is capable of outputting BSD
        linkable objects that will be eaten up by ALN the compiler can
        be used along with separate MADMAC assembler modules quite well
        which makes it a nice development environment especially for
        larger projects such as games (I've come to find that writing a
        full game completely in assembler isn't my cup of tea after sort
        of finishing up wolf3d btw.). On the other hand it fucks up
        under dosBox quite often and there isn't any set of readily
        adapted/compiled libraries which is why I'm putting my hope into
        swapd0 finishing his new JaguarC compiler. Here's what you'll
        need to get started (you can find the required files on the
        starcat dev CD or on the web - read below):

        Depack the full set of GCC tools into a directory of your
        harddrive called "bin" or something. It should now contain two
        subfolders called "AGPU/2.6/" and "M68K/2.6" where the cross
        compilers will reside. The next step is to set up some
        environment variables (for dosBox etc.):

        set PATH=directory_where_you_put_your_bin_utils
        set LIBDIR=directory_where_you_put_gccs_libraries
        set INCDIR=directory_where_you_put_include_files
        set GCC_EXEC_PREFIX=/                      
             # you should leave this blank

        set TMPDIR=directory_where_you_want_temporary_files_to_be_put 
             # GCC
        set TEMP=directory_where_you_want_temporary_files_to_be_put  
              # same for madmac

        set RDBRC=path_to_search_for_RDB.RC        
             # alpine debugger users only

        set DBPATH=path_to_search_for_the_debugger # dito
        set ALNPATH=directory_where_you_put_ALN
        set MACPATH=include_directory_for_madmac


        Then you'll need to set up the build specs for GCC and the two
        cross compilers (to be found in a file called "SPECS" in
        "/bin/", "/bin/AGPU/2.6/" and "/bin/M68K/2.6"). make sure at
        least the following directives are included and adapted to your
        config:

        (1) "/bin/SPECS" and "/bin/M68K/2.6/SPECS":

        *asm:
        -cgcc -fb
        *cpp:
        -nostdinc -I/put_your_include_directory_here/
        *predefines:
        -DJAGUAR -Amachine(JAGUAR) -DM68K -Acpu(M68K)
        *cross_compile:
        1


        (2) "/bin/AGPU/2.6/SPECS":

        *asm:
        -cgcc -fb
        *cpp:
        -nostdinc -I/put_your_include_directory_here/
        *predefines:
        -DJAGUAR -Amachine(JAGUAR) -DAGPU -Acpu(AGPU)
        *cross_compile:
        1


        At least for me it helped to set things going that way after a
        lot of messing around. Feel free to drop me a line via email if
        you encounter problems setting up the Jaguar crossGCC.

        STAY COOL - STAY ATARI

                                        ray / tSCc for Alive, 2006-08-10


        Please note that due to the pure size of this article I decided 
        to split it after ray submitted it. You'll find the Review of 
        the Starcat Development CD which is mentioned above within the 
        (P)review category of this issue, enjoy...
                                                                     cxt
Alive 13