Beyond Brown

When brown just isn't enough

Blitter FAQ

     ____    ___    ___________      ___________
    /    \  /  /\  /__/__   __/_____/  ____/    \
   /     /\/  / / /  /\//  /__   __/  __/ /     /\
  /      \/  /_/_/  / //  /\_/  / /  /___/  /\  \/
 /_______/______/__/ //__/ //  / /______/__/ /\__\
 \_______\______\__\/ \__\//__/ /\______\__\/  \__\
                           \__\/

The Atari ST(E) BLiTTER in brief

A comprehensive overview

by The Paranoid of Paradox 2012

Table of Content

a.) Introduction

b.) Register Overview

c.) What to do with it

d.) Common mistakes

e.) Compatibility

f.) Summary

Appendices

a.) Introduction

This little documentation covers the Atari STE BLiTTER, its registers and how to operate them. It somehow relies on the STE-FAQ by the same author from a couple of years ago for the basics and please refer to this document for a more detailed look on the registers and their internals.

It should also be noted that the Atari STE BLiTTER is identical to the Atari ST BLiTTER regarding its registers and interface, hardware-wise, it is different as it has been integrated into the COMBEL, a chip that hosts various functions.

What actually is the BLiTTER for ? Having a look at Atari’s own official documentation, the BLiTTER is just the Bit-Blt (BLiT) algorithm as specified by Newman and Sproul (“Principles of Interactive Computer Graphics”, McGraw-Hill, 1979, Chapter 18). It is meant to copy graphical data, organized in bit maps, and manipulate those by applying masks, halftones and logically combine source and destination. The Atari STE BLiTTER is a slightly more advanced version of this set of algorithms since it has direct memory access and does not rely on an external bus master to feed it data.

This little documentation is not a hardware reference manual but instead describes some algorithms of what can be done by the BLiTTER if programmed acurately.

Thanks to various people like Ray of .tSCc., ultra of Cream^Orb, Kalms of TLB, No of Escape and Tobe of MJJ Prod for their help and support. Maximum thanks of course to my crewmates 505, Dan, RA and Zweckform of Paradox.

Also, this documentation relies on the Atari ST/STE/TT Profi- Book by Jankowski, Rabich, Reschke, Sybex, 12. Auflage 1992, The TOS 1.04 Update Book by Pauly, Data Becker.

b.) Register overview

The Atari STE BLiTTER is memory mapped and capable of direct memory access. The registers of the Atari STE BLiTTER are

$FFFF8A00: Halftone Pattern 0  (16 Bits)
$FFFF8A02: Halftone Pattern 1  (16 Bits)
$FFFF8A04: Halftone Pattern 2  (16 Bits)
...        ...
$FFFF8A1E: Halftone Pattern 15 (16 Bits)

These registers make up a 16 x 16 pixels dither pattern to be used for masking source data on special request (see below).

$FFFF8A20: Source X Increment (15 Bit - Bit 0 is unused)
$FFFF8A22: Source Y Increment (15 Bit - Bit 0 is unused)
$FFFF8A24: Source Address     (23 Bit - Bit 31..24, Bit 0 unused)

This sets up everything related to source addressing. It sets the 24-Bit base address where to read data from, how many bytes to add after each word copied (X Increment) and how many bytes to add after each line copied (Y Increment) to the source address.

$FFFF8A28: ENDMASK 1 (16 Bits)
$FFFF8A2A: ENDMASK 2 (16 Bits)
$FFFF8A2C: ENDMASK 3 (16 Bits)

These masks are overlaid with the destination words and only bits that feature a “1” are actually manipulated by the BLiTTER. All bits containing a “0” will be left untouched. ENDMASK 1 refers to the first word per line being copied, ENDMASK 3 to the last word in each line being copied while ENDMASK 2 affects all copies in between.

$FFFF8A2E: Destination X Increment (15 Bit - Bit 0 unused)
$FFFF8A30: Destination Y Increment (15 Bit - Bit 0 unused)
$FFFF8A32: Destination address     (23 Bit - Bit 31..24/0 unused)

Much like the source address generation, this covers the target or destination addressing, containing increment per word being copied (X), per line (Y) and the target address to start with.

$FFFF8A36: X Count (16 Bits)
$FFFF8A38: Y Count (16 Bits)

These two registers configure how many words (!) to copy per line (X) and how many lines to copy in total (Y). Please note that these values are 16 Bits wide and unsigned.

$FFFF8A3A: HOP (8 Bits)
$FFFF8A3B: OP  (8 Bits)

HOP stands for “Halftone OPeration” mode and configures in what way halftone and data read from memory using source addressing are being overlaid. It contains values 0..3:

  • 0 = All bits are generated as “1”
  • 1 = All bits taken from halftone patterns
  • 2 = All bits taken from source
  • 3 = Source and halftone are AND combined

The OP Register stands for “OPeration” mode and configures how destination and source data are being overlaid. It contains values 0..15:

  • 0 = Target Bits are all “0”
  • 1 = Target Bits are Source AND Target
  • 2 = Target Bits are Source AND NOT Target
  • 3 = Target Bits are Source
  • 4 = Target Bits are NOT Source AND Target
  • 5 = Target Bits are Target
  • 6 = Target Bits are Source XOR Target
  • 7 = Target Bits are Source OR Target
  • 8 = Target Bits are NOT Source AND NOT Target
  • 9 = Target Bits are NOT Source XOR NOT Target
  • 10 = Target Bits are NOT Target
  • 11 = Target Bits are Source OR NOT Target
  • 12 = Target Bits are NOT Source
  • 13 = Target Bits are NOT Source OR Target
  • 14 = Target Bits are NOT Source OR NOT Target
  • 15 = Target Bits are all “1”

    $FFFF8A3C: Misc. Register (8 Bits)

This register is a bit structure of the following content:

  • Bit 7 = BUSY Bit (Write: Start/Stop, Read: Status Busy/Idle)
  • Bit 6 = HOG Mode (Write: HOG/BLiT mode, Read: Status)
  • Bit 5 = Smudge Mode (Write: Smudge/Clean mode: Read Status)
  • Bit 3..0 = Line number of Halftone Pattern to start with

    $FFFF8A3D: Misc. Register (8 Bits)

This register is a bit structure of the following content:

  • Bit 7 = Force eXtra Source Read (FXSR)
  • Bit 6 = No Final Source Read (NFSR)
  • Bit 3..0 = Skew (Number of right shifts per copy)

A lot of registers of which some are totally confusing - What the heck is a smudge mode ? And where’s the difference between HOG and BLiT-mode ? For more details, please see the STE-FAQ in which most of the registers are being explained in much more detail. However, to give a comprehensive summary, all registers of the BLiTTER should be initialized by software before usage. There is no way of telling in what state the BLiTTER is when your software takes over so that every register should be cleaned even if it is not being used by your software. Once the BLiTTER has been initialized, the minimum setup to get it going implies writing source increments and address, destination increments and addressing, the OP-mode, count registers and set it to “busy” - for simple copies in OP-mode “3” aligned on word boundaries, that’s sufficient. The BLiTTER modifies some of the registers internally, namely the source and destination address registers and the Y count. The X count is latched internally and restored after each line. For consecutive copies with incremental addresses, it’s therefore sufficient to simply set a new number of lines in Y count and to set it to “busy” again.

The BLiTTER is (unfortunately) not restricted to fixed execution times. Its effective execution times “per (16-Bit) copy” heavily depend on the chosen HOP- and OP-modes, in clock cycles:

      OP
HOP    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
   0   4  8  8  4  8  8  8  8  8  8  8  8  4  8  8  4
   1   4  8  8  4  8  8  8  8  8  8  8  8  4  8  8  4
   2   4 12 12  8 12  8 12 12 12 12  8 12  8 12 12  4
   3   4 12 12  8 12  8 12 12 12 12  8 12  8 12 12  4

(Actually, ENDMASKs differing from $FFFF are rumoured to also delay the BLiTTER, i have not yet verified this.)

To optimize code that writes to the BLiTTER often and regularly, it might be interesting to know which registers must be set by software, which are updated by the BLiTTER and which are sort of constant until software requires different settings:

$FFFF8A00: Halftone pattern 0..15: Constant $FFFF8A20: Source increment X and Y: Constant $FFFF8A24: Source address: Updated by BLiTTER $FFFF8A28: ENDMASKs 0..2: Constant $FFFF8A2E: Dest. increment X and Y: Constant $FFFF8A32: Dest. address: Updated by BLiTTER $FFFF8A36: X-Count: Latched by BLiTTER $FFFF8A38: Y-Count: Must be updated by CPU $FFFF8A3A: Setting-Registers: Constant (except for BUSY)

This implies that the minimum number of registers to set before a BLiT-operation is to write the Y-Count number of lines and to set the BUSY-Bit. The address registers are automatically updated by the BLiTTER internally (using the configured increments) and the X-Count is internally latched so that the BLiTTER automatically restores the value having been in the register after completing one line.

While this settles the descriptive part about the interfaces and some internals of the BLiTTER, what can we actually do with it ?

c.) What to do with it ?

So what can the BLiTTER do for you ? What does it do quickly and how can you gain maximum benefit in your software from it ? Here are a few techniques i have found useful and i try to arrange them in order of difficulty …

1.) Saturated increments and decrements on word-based chunky buffers

One of the simplest tricks that the BLiTTER can do for you is that it can do saturated increments and decrements quite quickly on chunky graphics buffer in which each pixel consists of a word instead of a byte. Commonly, chunky graphics on the Atari STE are pre-multiplied by 4 to have quicker conversions to the planar format that the Shifter requires so that pixel colour “1” is represented by the value “4”, colour “2” by “8” and so on, up to colour “15”, which is represented by the value “60”. All this effect actually requires is a decently organized halftone pattern, the smudge mode and some logic …

The halftone pattern is just a 16 word table. When doing saturated decrements, each element holds its line number minus “1”, for saturated increments, each element holds its line number plus “1” and, of course, both need to be overflow/underflow protected:

            Decrement                Increment

Hafltone 0: 0 - 1: $0000 0 + 1: $0004 Halftone 1: 1 - 1: $0000 1 + 1: $0008 Halftone 2: 2 - 1: $0004 2 + 1: $000C Halftone 3: 3 - 1: $0008 3 + 1: $0010 … … … Halftone 14: 14 - 1: $0034 14 + 1: $003C Halftone 15: 15 - 1: $0038 15 + 1: $003C

Now the BLiTTER can be made to use the lowest 4 bits of the word it just read to look up a line of the halftone pattern and to use this one for the configured HOP-mode. While this has been intended to allow some sort of random-ish halftone pattern usage, it works fine for what is required: If the BLiTTER reads, for example, the colour value “7”, it will look up line “7” in the Halftone pattern in smudge-Mode, which contains the value “6” in the decrement and the value “8” in the increment setup. While this sounds pretty easy at first, there’s one obstacle to overcome still: The words of the chunky graphic the BLiTTER reads are pre-multiplied by 4 (and so are the entries in the halftone pattern). Of the lowest 4 bits that the BLiTTER reads, only bits 2 and 3 are used in any way and bits 0 and 1 are always 0. But that’s not a problem at all since the BLiTTER can do right shifts BEFORE looking up the line of the halftone pattern by simply setting the “skew”-value to “2”. Now the BLiTTER reads a word which contains the numbers 0 to 60 in multiples of 4, shifts these down by 2 bits, making it range from 0 to 15 effectively, then looks up the line in the halftone pattern, reads the entry and copies it back. All that is required otherwise is to set the OP-register to “3” (Target = Source) and the HOP-register to “1” (Halftone only).

Unlike stated in various kinds of documentation, the BLiTTER does not only apply smudge once per line but on every read it performs, making this a quite quick way of doing simple saturated increments and decrements.

Can this also be used for byte-based chunky buffers ? Yes, it actually can, it only needs 2 passes then, and it works the following way: Set up halftone pattern registers like described above and set skew to “2” like above. However, set all ENDMASKs to “0x00FF” which implies that the low byte of each word will be affected only. Now activate smudge-mode and let the BLiTTER run over your chunky buffer once. Re-configure the halftone pattern by shifting each entry by 8 bits to the left, set skew to “10” and set all ENDMASKs to “0xFF00”, meaning that only high bytes will be affected. Now make the BLiTTER run over your chunky buffer again. What did it do this time ? It reads a word and shifts it down by 10 bits, effectively shifting out the low byte completely and scaling the high byte down to range from 0 to 15. Then it uses the result to look up the line of the halftone pattern, which contains a value preshifted by 10 bits. Now the BLiTTER writes back the whole word but the ENDMASKs prevent it from overwriting the low byte having been generated in the previous pass.

Sneaky, isn’t it ?

(Actually, you don’t even need to reconfigure the halftone pattern when switching from odd to even bytes or back. All you need to do is set the halftone pattern to bytes, too, like $0000, $0404, $0808, $0C0C, $1010, $1414, $1818… and the ENDMASKs will take care of protecting the byte not to be overwritten)

2.) Making the BLiTTER “add” colour values in c2p based effects

It was Kalms of The Black Lotus who wondered why no one actually thought of using the BLiTTER for simple maths on the STE - which seems to be quite common on the Amiga. The problem is that the BLiTTER in the STE does not have some sort of overflow treatment and therefore cannot be used for any kind of mathematics … Or can it ?

The trick is to separate the addition from the potential overflow that is involved and c2p is a very thankful concept for this. How does it work then ? Let’s consider a word per pixel again with only 8 colours being used effectively and let’s consider that the BLiTTER is meant to handle 4 graphic objects that should be overlaid cleanly, for example 4 blobs. Now reorganize the graphic of your blob to use bits 13 to 11 instead of bits 4 to 2. Then make the BLiTTER copy the first blob using “Target = Source” and ENDMASKs set to all “0xFFFF”. This will both clear the chunky buffer under the blob as well as produce the first blob. Copy the second blob with a skew value of “3” using the same graphic and the OP-register set to “Target = Source OR Target”. Copy the third blob using a skew value of “6” and the fourth one using a skew value of “9”, all using “Target = Source OR Target”. The resulting bits of each word your code has generated looks like this:

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 unused |____| |_| |_| |____| unused Blob 0 Blob 1 Blob 2 Blob 3

That didn’t really “add” the colour values of each blob, did it ? No, not yet. It requires a customized c2p table to turn the 4 pixel colours stored in this word to turn into 1 final colour. It will cost some memory since each pixel consists of 12 bits used instead of 4, but it’s still small enough: For double-pixel mode, it requires 4 x 4096 x 4 bytes = 65536 Bytes and is fairly easy to generate: - Loop over all potential colours of all 4 pixels - Perform saturated adds of all 4 pixel colours - Generate the c2p-table entry

Naturally, this routine is somewhat limited but 4 pixel colour can already be quite a lot, if, for example, displaying zooming blobs.

3.) BLiTTER sprites

3.a.) Where they make sense

BLiTTER based sprites are slightly more complex than they first seem, depending on the required amount of flexibility. Hence, the more flexibility you need, the more complicated your BLiTTER routine will be and at the same time, the more speed gain you will observe. If you want to push around bulky sprites with many animation phases so that preshifting doesn’t really work out, they make sense. If you want to generate sprites in realtime and therefore don’t have the option to preshift, they make sense. If you need additional funky operations such as applying halftone patterns or logical operations, they make sense.

3.b.) Where they do not make sense

On small-scaled, non-masking 2 bitplane sprites commonly used for sprite record screens - The CPU can do that a lot faster. They also make no sense when the sprites your routine handles can be fully preshifted and do not really require to be fully masked. Also, on the Falcon030, the BLiTTER can barely keep up with the CPU in some operations; in many operations the CPU is faster and only in a small amount of operations the BLiTTER turns out to be faster. For true-colour sprites however, the BLiTTER hardly makes sense.

3.c.) Positioning and sizes

You know that the Atari STE shifter uses interleaved bitplanes of words. So to position your sprite horizontally, you need to take the x-coordinate, divide it by 16 to find out which word - i.e. which 16 bits in screen memory in this line - it is and then multiply by 8 in low-res to go to the correct set of words since 4 words = 8 Bytes make up 16 pixels in 4 bitplanes (in mid-res, you multiply by 4 Bytes = 2 words and in high-res, you do not multiply at all). To get the correct line, you simply multiply the y coordinate of your sprite with the number of bytes per line and add that to the target address.

Now you calculated the target address for the correct X and Y coordinate, but your sprite will always start from the leftmost pixel, so you need an additional offset to find out which bit your sprite needs to start - Naturally, this is the X coordinate modulo 16 (or AND 15) and to this coordinate, you need to shift your sprite. To sum it all up, you need to generate 3 values from your X and Y coordinate:

  • Y coord * (Number of bytes per line) -> Add to target address
  • (X coord / 16) * (width of bitplane set) -> Add to target address
  • (X coord & 15) -> Amount of right shifts for your sprite

And since the BLiTTER is terribly good at shifting - unlike the plain 68000, this should speed things up. Or shouldn’t it ?

3.d.) Some nonsense about the BLiTTER’s internals

These bits in one of the configuration registers are terribly confusing when confronted with the BLiTTER first and are still hard to explain. So let’s figure out some of the BLiTTER’s internals.

The BLiTTER actually has a 32-Bit internal buffer of which usually only the lower 16 bits are used when using SKEW = 0:

Bits  31 29 27 25 23 21 19 17  15 13 11  9  7  5  3  1
      |______________________| |______________________|
        Overlap/Skew buffer          Read/Write

So what are the upper 16-Bits actually good for ? They actually make up the skew. Instead of shifting bit by bit, the BLiTTER simply reads out 16 Bits from position (15 - SKEW), using (16 - (16 - SKEW)) bits from the upper Overlap/Skew buffer. After each write, the BLiTTER automatically copies the lower 16 bits into the upper and procedes.

This way, the BLiTTER appears to be shifting to the right (which it is not) and at the same time allows to “carry” over all bits being “shifted” out of the initial word being read into the next copy. For example, if we consider a SKEW value of 3, the BLiTTER reads the source word into the lower word using bits 15..0, then reads out bits 18..3 (filling the 3 upper bits with the garbage that the upper word of the BLiTTER internal buffer still features so make sure your ENDMASKs are set up decently!). Afterwards, it copies Bit 15..0 to Bit 31..16, then reads new data into the Bits 15..0, but again reads out Bit 18..3, now containing the right- most 3 bits of your previously copied word in the 3 left-most bits! This way, the BLiTTER can “carry” up to 15 Bits without consuming any additional time at all.

Back to the sprites. Why is it so important to know about the BLiTTER’s internal for sprites when it’s so obvious that the number of right-shifts calculated in chapter 3.c.) are to be written into the SKEW register ? Because it might not be sufficient. And here’s where one of the two most complicated registers of the BLiTTER come in handy.

3.e.) The No-Final-Source-Read Bit

Now let’s consider a sprite you want to copy to any possible x-coordinate so you write the x-coordinate modulo 16 to the SKEW register and sometimes, this doesn’t really work out. Why ? Because the BLiTTER shifts out bits - as soon as SKEW is unequal to zero - and needs one additional write access to put out the final bits having been “carried” from the last word having been read. It does not need to COPY another word, it only needs to WRITE the final bits.

And that’s what NFSR is for. Here’s what it does: On the last copy of each line, the BLiTTER does not read the source but instead only reads out the 16 bits from the current Overlap/Skew and Read/Write buffer as given by the SKEW value and writes them to memory, according to the target address, implying that the final set of “carry” bits are now being written. Since the lower 16 Bits still contain some “outdated” data - the BLiTTER has performed a read access to refresh this buffer when NFSR is set - you will observe junk on the screen unless your ENDMASK3 is set up properly (hint!).

Also, this final copy counts as one with regard to the X-Count register, even when NFSR is set, meaning that you need to set the X-Count register to one more word than you would if SKEW was zero. However, the source address is incremented one time less than the destination address is because the last word being written has not been read so you also need to adopt the Y increment for source addressing accordingly.

3.f.)

Not really needed for BLiTTER sprites but while we’re at it So now that we have explained the NFSR-register what actually does the Force-eXtra-Source-Read (FXSR) register do ? Easy. Wouldn’t it be all a bit limited if the BLiTTER couldn’t do left-shifts ? Exactly. And that’s what FXSR is for. When using FXSR, the BLiTTER will read one word into the top 16 bits prior to anything else it does. And by doing so, it effectively shifted to the left by 16 pixels so that the number of effective left shifts is given by (16 - SKEW). However, FXSR also implies that the source address will be increment once more than the destination address will be, so set up your Y increments accordingly. Unlike NFSR, it does not decrease the X Count register as it is treated as an extra read and not as a copy.

Does it make any sense to have NFSR and FXSR set at the same time ? While it first sounds like it would not, it turns out to be useful quite often: Whenever a left-shift is needed, even if only one bit, the FXSR needs to be set, but that also makes source address being incremented once so that, when the first real copy is being performed, it’s considered the first “write” but the second “read”. To make up for this, the NFSR does correct this on the final copy when it prevents one read but performs one write - balancing reads and writes. However, this depends strongly on the size of the bitfield you want to copy.

To make it more visible, let’s consider the example from the official Atari BLiTTER documentation:

NFSR + FXSR: Source  |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
SKEW = 15:   Dest    |..aaaaaaaaaaaaab|bbbbbbbbbbbbbb..|

Let’s consider each copy and its result. Before the first action is taken by the BLiTTER, source and target look like this:

No action:   Source  |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
yet          Dest    |................|................|
Blitbuffer:          |................|................|

Now, starting the BLiTTER makes it “force extra source read”:

Extra src.   Source  |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
read         Dest    |................|................|
Blitbuffer:          |...aaaaaaaaaaaaa|................|

Having done that, the BLiTTER will now perform the first copy by reading the next word from the source:

First read   Source  |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
             Dest    |................|................|
Blitbuffer:          |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|

Now, using a SKEW value of 15, it will write back data from the upper 16 bits mainly

First write  Source  |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
             Dest    |..aaaaaaaaaaaaab|................|
Blitbuffer:          |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|

Internally, the BLiTTER updates the upper 16 bits

Refreshing  Source   |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
            Dest     |..aaaaaaaaaaaaab|................|
Blitbuffer:          |bbbbbbbbbbbbbbb.|bbbbbbbbbbbbbbb.|

but will not READ any data due to the NFSR set. Instead, it will only read out the upper 16-Bits mainly (mind the SKEW) and write back the following value …

Writing    Source    |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
bad ENDMASK Dest     |..aaaaaaaaaaaaab|bbbbbbbbbbbbbb.b|
Blitbuffer:          |bbbbbbbbbbbbbbb.|bbbbbbbbbbbbbbb.|

but using a properly configured ENDMASK3, the lowest bit(s) should be masked out, therefore correctly generating

Writing   Source     |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
good ENDM. Dest      |..aaaaaaaaaaaaab|bbbbbbbbbbbbbb..|
Blitbuffer:          |bbbbbbbbbbbbbbb.|bbbbbbbbbbbbbbb.|

3.g.) Back to BLiTTER sprites - How to mask them ?

“Real” Software sprites require to be masked, meaning that all bits “used” by the sprite (unequal to a dedicated transparent colour, commonly zero) are being cleared before the sprite is being copied onto the target buffer while all bits not used by the sprite (equal to a dedicated transparent colour, commonly zero) have no influence at all on the background the sprite is being copied over.

This is usually done by using a so-called mask which is AND- combined with the background, which is an inverse of the sprites silhouette, meaning that all bits of the mask are “0” in which the sprite features a colour differing from a dedicated trans- parent colour and that all bits of the mask are “1” in which the sprite features the dedicated transparent colour.

The mask can be generated prior to the sprite generation or stored along with the sprite data.

Actually, the BLiTTER can generate the mask by itself but it requires as many passes as bitplanes are involved and it requires the transparent colour to be “0”: You copy the sprite data of each bitplane into one buffer using “NOT Source OR Target” OP-mode. The resulting buffer can be copied to the screen in AND-OP-mode for all bitplanes involved.

(Actually, it can be used for transparent colours unequal to zero, but then additional operations might be required).

There is a slightly restricted but much much faster way of getting a mask to be AND-combined with the background:

If you can afford to restrict your sprite to 8 colours and can arrange the colours in a way that it only uses colours 8 to 15, you get a mask for free. In this case, bitplane 3 is set for every pixel differing from the transparent colour “0” and therefore represents a perfect mask. Simply copy it over the lower 3 bitplanes in the “NOT Source AND Target” OP-Mode to correctly mask your sprite.

Small sprites can also be masked using the ENDMASK registers. To do so, the appropriate mask for a single line needs to be copied to the ENDMASK registers, then, using an X-Increment of 2, all words for this line for all bitplanes are being masked. Using a suitable negative Y-increment, the BLiTTER can automatically jump back after having manipulated this line. The procedure needs to be repeated for every line (Thanks to RA for this trick!).

3.h.) Sprites of fixed horizontal size (multiples of 16)

This is the easiest case as long as the sprite data is left-aligned and that is usually the case. The first task is to check whether a SKEW-value of 0 can be used or not by AND-combining the x-coordinate of the sprite with 15. If the result is 0, no shifting is needed and the sprite can be copied onto the screen bitplane by bitplane, using either a pregenerated mask or bitplane 3 as mask as described above (Actually, if the SKEW-value is 0, all bitplanes can be copied in one linear go, but the routine will have to be separated completely from the case when the SKEW-value differs from 0, which is kind of odd).

In case the SKEW-value turns out to differ from 0, there’s basically two more things that need to be done: Generate decent ENDMASKs and correct the number of words to be copied per line, including setting the NFSR Bit and correcting the Y increments. And ENDMASK 1 is easy to generate: Simply shift it to the right by the SKEW-value (do not use ASR since it might shift in ONEs), invert all bits and use it as ENDMASK 3. ENDMASK 2 should be all ones anyhow.

Since your sprite’s size in X is a multiple of 16, setting the NSFR bit is required since bits will be shifted out of the last “copy” as soon as the SKEW value exceeds zero. This also means that there is one more word to be copied per line, so add one to the X-Count register, but since the source is not being incremented after the last copy while the destination address is, correct source and destination Y increments accordingly.

Then, copy the sprite data bitplane by bitplane (which is necessary now since the BLiTTER would mix up bitplane data by shifting if you tried to copy all bitplanes at once for a SKEW value other than zero), using either a pregenerated mask or bitplane 3 if you were using the trick explained in the previous chapter.

3.i.) Sprites of random horizontal size

Now this is tricky because generating ENDMASKs is quite confusing for this case. While ENDMASK2 can always be set to “All Ones” - it’s only used if there are more than 32 bits to be BLiT and then it covers the middle which is copied fully always - ENDMASK1 and 3 can be separated into 2 major cases:

  • Copy only affect leftmost 16 bits
  • Copy affects more than the leftmost 16 bits

However, it’s not sufficient to simply check the width of the buffer to copy. If the data is being shifted, even a BLiT of 2 bits width can require ENDMASK1 and ENDMASK3 to be set properly if being shifted by 15 bits! Similarly, setting or unsetting NFSR is equally difficult. If, for example, a block of 18 bits width is being copied with a shift count of up to 14, no NFSR is needed - 18 bits always require 2 words to be read which are being covered by ENDMASK1 and 3. But, if the shift count reaches 15, one overlap bit shows up, 3 words need to be written while only 2 are being read, hence, NFSR is required.

Actually, to generate all this in realtime is far too complicated so that packing all information in 2 tables is the best ways to deal with it:

a.) A table for 0..15 shifts for less than 17 bits to be copied, b.) a table for 0..15 shifts for more than 16 bits to be copied.

The table is attached to this document for everybody to use.

3.j.) Speed issues: Hog- vs. Blit-Mode

As you remember, the blit-register $FFFF8A3C incorporates the bit 6 that decides whether the BLiTTER runs in “hog”-mode (Bit 6 set) or in the so-called “blit”-mode (Bit 6 unset) also known as cooperative mode. What’s the difference and when to use what ? In “hog”-mode, the BLiTTER will not release the bus after having been started until the copy is completely finished. While the BLiTTER will, of course, share the bus with the Shifter (incl. DMA-sound) in “hog”-mode, the CPU has no chance of accessing the bus and will be stalled for as long as the BLiTTER takes to complete the copy.

In “blit”-mode, the BLiTTER will release the bus after 64 cycles and will reserve the bus again after 64 more cycles - which is why this mode is also nicknamed “cooperative” mode. This allows the CPU to get some work done while the BLiTTER is busy but will, naturally, slow down CPU AND BLiTTER visibly.

Which one is better ?

None, but each has its own advantages and disadvantages.

In “hog”-mode, the BLiTTER will not waste cycles reserving the bus (up to 7 cycles) and therefore run as quickly as possible. Using the timing table quoted in the first chapter it is possible to calculate the time after which the BLiTTER will release the bus again. However, during this time, the CPU will not have any access to the bus, not even for highly critical interrupts, potentially making it miss some interrupts that occured.

In “blit”-mode the CPU will gain access after 64 cycles for 64 cycles, but the BLiTTER will have to reserve the bus afterwards and therefore delay its copying by up to 7 cycles per turn. Nevertheless, the CPU will have a chance of reacting to interrupts with a delay of 64 cycles but the interrupt service routine should finish in less than 64 cycles, otherwise it is potentially stalled by the BLiTTER again.

There is an easy way to speed up the BLiTTER in “Blit mode”.

To understand how it is necessary to know that the BLiTTER latches the register $FFFF8A3C internally and that any write access to this register will update the latch.

To speed it up in “blit”-mode, do the following or.b #$80,$FFFF8A3C.w ; start the BLiTTER loop: bset.b #7,$FFFF8A3C.w ; (re)start the BLiTTER nop ; BLiTTER will need a few cycles bne.s loop ; Loop if registers shows “busy”

The example above is officially advertised by Atari and it works the following way: The first “or” instruction starts the BLiTTER more or less immediatelly. The “bset” instruction will test (i.e. read) bit 7 of the register and set the Z-bit in the CCR of the 680x0 processor accordingly, then set the bit again. The BLiTTER internally updates the register which has internally been idle, meaning that bit 7 was not set of the internal register, making the BLiTTER request bus access again. This update of the internal latch as well as the bus request does take some time so that the “nop” will be executed in any case. If the bit 7 was set, the “bne.s” will branch to the loop label. If the bit 7 was unset, the BLiTTER has finished the copy and resetting the “busy”-bit had no effect. The instruction in between the “bset.b” and the “bne.s” does not have to be a NOP of course. It can be any useful instruction that may be engaged before the BLiTTER is assigned bus access. The implementation above will speed up the BLiTTER to roughly 90% of “hog”-mode performance and still allow to react to interrupts within 64 cycles of delay.

In the 1990’s revision of Atari’s BLiTTER manual, they revised the code quoted above the following:

loop:
    tas     $FFFF8A3C.w        ; (re)start the BLiTTER
    nop                        ; BLITTER will need a few cycles
    bmi.s   loop               ; Loop if register shows "busy"

If an interrupt service however cannot be completed in 64 cycles and should not be stalled by the BLiTTER again, Atari states that it is possible to “stall” the BLiTTER by writing Bit 7 to zero - But Atari points out in the official documentation that it needs to be set to the original state before end-of-interrupt is reached:

move.b $FFFF8A3C.w,-(sp) ; Write register to stack bclr.b #7,$FFFF8A3C.w ; unset busy bit nop ; BLiTTER might interfere here

move.b (sp)+,$FFFF8A3C.w ; restore original register

Unfortunately, this does not work at all and is not mentioned in other documentation regarding the BLiTTER i have read so far. This seems to origin from the internal state handling. The CPU reads a “1” in the BLiT-busy bit when reading out the register, even though internally it is a “0” because the BLiTTER is pausing for 64 cycles. Writing through a “0” will not have any effect and not stop the internal state handler to set it back to “1” internally after 64 cycles. The BLiTTER will only write a “0” if it is done copying data (i.e. the line counter turns “0”). There does not seem to be any other way of stalling the BLiTTER on request.

So, if parts of the code need to be protected from potential BLiTTER interference, there is only one decent way of doing so: Use the BLiTTER in hog mode but only engage small segments at a time. Since the BLiTTER does not reset any of its registers except for the X Count register, the BLiTTER can easily be used for subsequent copies by just setting Y Count as desired (and Line Number if needed) and start the BLiTTER manually to engage the next copy.

3.k.) More speed issues: Wasting time?

It might often not be possible to organize CPU code around BLiTTER operations, but if you do, you might gain a lot of speed by placing the right instructions directly in front of a BLiTTER operation. As the BLiTTER might require a couple of cycles before actually doing anything, the CPU usually has a chance of loading the next instruction before the BLiTTER hogs the bus. If this instruction requires no additional bus access, it will be carried out in parallel to the BLiTTER operation, making the Atari ST/STE a dual-processor system for this short period.

Naturally, the speed gain is higher if the instruction being carried out by the CPU is lengthy, ideally a multiplication or division. If using registers only (which means the instruction does not require additional bus cycles to fetch data), the CPU can carry out this instruction partially or completely in the “background” while the BLiTTER is busy. The code will have to look something like this:

move.w dividend,D5 ;load value to be divided move.w divisor,D6 ;load value to divide by or.b #$80,$FFFF8A3C.w ;start the BLiTTER divs D6,D5 ;will run in parallel!

Again, thanks to RA for figuring this one out.

Naturally, on an 68020 or any subsequent 680X0, you can also try to “lock” the CPU into accessing cache only and not require bus accesses while the BLiTTER is busy. To do so, you need to make sure the cache “locks” the lines containing the code you want to run while the BLiTTER is busy and then branch to exactly this memory region before the BLiTTER is granted bus access.

On an 68020, you will run into the problem that the code being executed must not write any result data to memory because the 68020 has no data cache and will therefore attempt to access the bus. On the 68030, you can buffer some data in its data cache, but then you must make sure the cache has enough “empty” space to store the data, otherwise, it will try to write back lines to memory which would stall the cache because the bus is granted to the BLiTTER. The easiest way to do so is to flush the data cache immediatelly before starting the BLiTTER.

This, on the other hand, is not a suitable way on an 68040 or 68060 as these CPUs have incredibly large caches (probably not by today’s standards, but for the time being) and simply flushing the whole data cache might not only stall the CPU for many cycles while the cache is updating itself but might also remove data from the cache you wanted your code to operate on out of the BLiTTER context.

3.l.) Official Atari code

As pointed out by ggn, the official Atari BLiTTER programming manual, June 17th 1987, The Atari Corporation, Sunnyvale, California, has sample BLiTTER code from page 17 on. The original document as well as its revision from Jan 25th 1990 can be found under

http://dev-docs.atariforge.org

hosted by Atari Users Network!

The sample code is also attached to this document at the very end. Using a text editor, it should be easy to extract it into an ASCII .S-file to load into any assembly editor.

4.) BLiTTER modified and generated code

Self-modifying and self-generating code are most certainly the queens of demo-effect fundamentals. With one exception though: If the code to be generated or modified becomes too large, the concept fails: The CPU takes too much time modifying/generating bulky code if it, afterwards, needs to run the modified/generated code as well. Usually, the code is therefore only generated/modified for a subset of necessary code, for example for a single line for which the code can quickly be modified or generated, and this code is run multiple times. If this is, however, no longer possible because subsets of the code to be run multiple times can no longer be defined, the BLiTTER can come in handy and generate/modify code.

4.a.) BLiTTER modified code

Let’s say you have a routine to generate a rectangular graphics block and it is completely unrolled and therefore bulky. Now data needs to be filled in for a sequence of instructions that can be read from a predefined block. Naturally, the code contains a bit of “overhead”, for example line break generation and similar, that would need to be skipped when filling in data.

No problem using X and Y increments wisely.

All you need to do is set up a table with the data to be read by the BLiTTER and set X and Y source increments accordingly. Please note that the BLiTTER can’t read bytes so that if you want to copy data in less that 16 Bit packages, you will have to arrange your source table that it is stuffed to 16-Bit boundaries per copy.

The target for the BLiTTER is the unrolled loop and the X and Y destination increments now need to be selected carefully. Because the 680x0 instructions are minimally 16 bits wide, the BLiTTER can modify each instruction easily and by using appropriate ENDMASKs, it is also easy to ensure that only the desired bits of each instruction are being modified.

An example: Let’s say, you have groups of 8 instructions which are 24 Bytes in total, every 2nd instruction is supposed to be modified, then an instruction block of 8 Bytes needs to be skipped. For such a routine, the destination X increment would be 6 Bytes and the destination Y increment would be 6 + 8 Bytes due to the fact that the BLiTTER does not increment X on the last copy of a line. The X counter would be set to 4 to go through the inner block, the Y counter needs to be set to cover the whole unrolled code.

And there you go. BLiTTER modified code. Simple as that.

4.b.) BLiTTER generated code

Even though it’s rather unlikely to generate code by the BLiTTER, there’s still a fairly special case that might come in handy. Let’s say you have a huge block of code in which some instructions need to be generated on short notice. A compact table defines which instruction out of 16 in total needs to be put into certain places and the target addresses are in some way regular, ideally in a rectangular block of the code to be modified.

The trick is now to load all potential 16 instructions into the halftone pattern, switch the BLiTTER to “halftone only” HOP mode and activate SMUDGE mode. Now set up the source and destination increments as required for reading out the table and writing into the code block and, in case the source table is premultiplied by a constant that is a power of 2, also set the SKEW value accordingly so that the data read in covers the range 0..15 after SKEWing.

Using the SMUDGE mode will now make the BLiTTER read in a word from the source table, shift it down to the range 0..15 and use the result as an index for the halftone pattern. The line being indexed is then written to the target buffer.

Even though this method is quite restricted, it makes it fairly easy to generate for example line breaks plus minor overhead code such as correcting address registers into large blocks of screen related code.

5.) Special Use in Chunky Graphics Mode

5.a.) Speedup in 16 (or less) colours

So far, the BLiTTER has mainly been used to manipulate bitplane graphics for which it had been designed in the first place. So how to use the BLiTTER sensibly in chunky graphics mode when not doing saturated adds, increments or decrements ?

Often, it’s not worth the effort: When copying data, meaning that HOP is set to “2” while OP is set to “3”, the BLiTTER takes 8 cycles per word (See table in the first section). When using the 68000 to copy words using “movem.w”, it takes 12 cycles + 4 per register being addressed by the instruction, meaning that it takes 8 cycles per word when using “movem.w” with 3 registers and a mere 5.5 cycles per word when using “movem.w” with 8 registers.

There is, however, a simply trick to speed up the BLiTTER when a maximum of 16 colours per pixels is applied: By using the SMUDGE mode, a well configured halftone pattern, HOP mode “1” and OP mode “3”. The halftone pattern needs to contain all 16 possible target values (most likely values 0..60 in multiples of 4) and the SKEW value needs to be set to the exact number of shifts necessary to scale every pixel value read by the BLiTTER down to cover bits 0..3 (most likely 2). As a result, the BLiTTER will read a word, shift it down by the given amount of bits, use the resulting bits 0..3 as a lookup index for the halftone pattern and write the value from the halftone pattern to the destination.

Looking at the clock cycle table in the first section, the BLiTTER requires a mere 4 cycles per copy then!

Naturally, the CPU can still outspeed the BLiTTER if the chunky buffer is byte based because the CPU will copy 2 bytes at once while the BLiTTER, using the smudge mode, will only copy 1 byte effectively. But, if manipulation of the data is needed while copying (such as saturated incrementing/decrementing, shifting or application of ENDMASKs) this concept is impossible to beat by the CPU!

Again, it’s not necessary to have a word-based chunky buffer for this concept, if separated over 2 individual runs, one refering to the high byte (affecting SKEW, Halftone content and ENDMASKs), one refering to the low byte (affecting SKEW, Halftone content and ENDMASKs), it can be applied on byte buffers just the same, introducing minor overhead though.

5.b.) The holy grail (?) - BLiTTER C2P

So, can the BLiTTER perform chunky-to-planar conversion ? It can. Can it convert faster than the CPU ? Not necessarily, but it does have its advantages as will be laid out end of this chapter.

Actually, BLiTTER c2p is anything but complicated. Much like the CPU-based c2p uses tables to convert chunky pixels into a fragment of a planar word, the BLiTTER can use the halftone pattern. Again, the colour of the chunky pixel just having been read is being used as an index by employing the SMUDGE mode and an intelligent SKEW.

Let’s go through it step by step.

First of all, we need to prepare the halftone pattern. Since the BLiTTER can only write 16 bits at a time, which is exactly the size of one bitplane word, we need to cover all 4 bitplanes of the Atari ST(E) in 4 separate runs. Naturally, for these separate runs, the halftone pattern needs to be constructed individually:

Index    Plane 0   Plane 1   Plane 2   Plane 3
  0       $0000     $0000     $0000     $0000
  1       $FFFF     $0000     $0000     $0000
  2       $0000     $FFFF     $0000     $0000
  3       $FFFF     $FFFF     $0000     $0000
  4       $0000     $0000     $FFFF     $0000
  5       $FFFF     $0000     $FFFF     $0000
  6       $0000     $FFFF     $FFFF     $0000
  7       $FFFF     $FFFF     $FFFF     $0000      
  8       $0000     $0000     $0000     $FFFF
  9       $FFFF     $0000     $0000     $FFFF
 10       $0000     $FFFF     $0000     $FFFF
 11       $FFFF     $FFFF     $0000     $FFFF
 12       $0000     $0000     $FFFF     $FFFF
 13       $FFFF     $0000     $FFFF     $FFFF
 14       $0000     $FFFF     $FFFF     $FFFF
 15       $FFFF     $FFFF     $FFFF     $FFFF      

Now to make sure the right pixel is being used by the BLiTTER to be converted, the SKEW mode needs to be set accordingly. For example, for a compact nibble buffer with 4 chunky pixels per word:

Bit    15 14 13 12 11 10  9  8    7  6  5  4  3  2  1  0
Pixel   D  D  D  D  C  C  C  C    B  B  B  B  A  A  A  A
       |_________| |_________|   |_________| |_________|
         leftmost     left          right     rightmost
           pixel      pixel         pixel       pixel

So to convert the leftmost pixel in the above example, we need to set SKEW to 12, for the left pixel to 8, for the right pixel to 4 and for the rightmost pixel to 0.

To make sure the BLiTTER will not overwrite previously converted pixels when BLiTTING subsequent runs, the ENDMASKs also need to be set accordingly, using the popular double pixel mode:

Leftmost pixel:   All 3 ENDMASKs to %11000000 00000000
Left pixel:       All 3 ENDMASKs to %00110000 00000000
Right pixel:      All 3 ENDMASKs to %00001100 00000000
Rightmost pixel:  All 3 ENDMASKs to %00000011 00000000

Obviously, this is not sufficient to convert all pixels in the target word, so we need to schedule 4 additional runs using the SKEW values from above but these ENDMASKs:

Leftmost pixel:   All 3 ENDMASKs to %00000000 11000000
Left pixel:       All 3 ENDMASKs to %00000000 00110000
Right pixel:      All 3 ENDMASKs to %00000000 00001100
Rightmost pixel:  All 3 ENDMASKs to %00000000 00000011

And finally, we need to set HOP and OP accordingly. Since we want to write back Halftone content only, HOP needs to be set to “1”. Since we want the 2 bits we set per run to replace existing content in these 2 bits, we need to set OP to “3”.

Now, all the CPU needs to do is to control how the BLiTTER works. Ideally, the BLiTTER steps through each of the 8 target pixel positions (changing SKEW and ENDMASKs) before switching to the next bitplane (loading ENDMASKs, resetting SKEW and ENDMASKs).

Does it pay off ? Using a HOP-mode “1” and an OP-mode “3”, the BLiTTER needs 4 cycles for each BLiT. However, since the BLiTTER needs to write each bitplane individually, it requires 4 x 4 cycles for a single chunky pixel, which makes 16 cycles per pixel.

This is not necessarily faster than the CPU can do, even on a compact nibble buffer where an additional shift of 2 bits and clever usage of the c2p-table (which is then 256KB in size) are needed. Clever usage of movem.w and unrolled loops can bring even this kind of c2p conversion in the range of 15 cycles per pixel.

But the BLiTTER c2p does have advantages:

  • No bulky table needed
  • Compact nibble buffers can easily be used
  • Scalable as 3/2/1 bitplanes can be handled using this algorithm, which then also require 34, 12 or 14 of the initial runtime.
  • Dither patterns can easily be used through the halftone pattern
  • By loading ENDMASKs properly, the target buffer does not need to be aligned on a word boundary but can be shifted to any even position

Of course, the BLiTTER c2p also comes with a set of disadvantages:

  • Limited to 16 colours per chunky pixel
  • Not necessarily faster than a CPU-based concept
  • No customisation possible which could increase performance
  • Max2p concept not applicable

Well, to be honest, the BLiTTER “could” theoretically also manage 8 bits per chunky pixel, but it would need to run twice per chunky pixel then, severely slowing it down while the CPU-based c2p concept only needs a larger table and 2 movep instructions per 4 pixels, which will make it slower, but not necessarily down to half the speed.

6.) Maximum 2 Planar (max2p) and the BLiTTER

What is the Maximum 2 Planar (max2p) or the recently updated Maximum 2 Planar Extreme (m2px) algorithm and how does it benefit from using the BLiTTER ?

The idea behind max2p (and m2px) is to use otherwise unused bits to store additional pixel information being interpreted while doing the chunky to planar conversion. A fairly conservative approach would be to add 2 “normal resolution” background pixels with a single chunky pixel, for example by locating the bitgroups like this:

Bit   15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
         |_________| |_________| |____________|
          left bgnd   right bgnd  chunky pixel

In the example above, there is also 1 overflow bit reserved so that an algorithm generating the chunky pixels could even add 2 pixel values of 4 bits each without risking to potentially overwrite background information.

And guess which chip is ideal to fill in either bits 11 to 14 or bits 10 to 7 or bits 6 to 2 without touching any other ? The BLiTTER by using SMUDGE, decently set up halftone patterns and decently set up ENDMASKs.

A similar approach is to combine 2 chunky pixels with a different colour information, for example a 2 bit alpha layer per pixel:

Bit   15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
            |_________| |___| |___| |_________|
           left chunky  left  right right chunky
                       alpha  alpha

This way, 2 chunky pixels can easily be converted at once while at the same time, additional colour information can be handled completely for free. And which is the best way to modify the 2 middle bits without touching either the left or right chunky pixel information ? Again, the BLiTTER by decently setting up SKEW, SMUDGE, halftone pattern and endmasks.

Maximum 2 Planar Extreme (m2px) goes one step beyond that. It exploits the fact that in almost all CPU-based c2p algorithms, the 1 byte static offset in indirect indexed addressing for the move/or when converting a pixel is unused:

movem.w      (A0)+,d0-d3    ;fetch 4 chunky word sized pixels
move.l  0(A1,d0.w),d0       ;convert leftmost pixel using table A1
or.l    0(A2,d1.w),d0       ;convert left pixel using table A2
or.l    0(A3,d2.w),d0       ;convert right pixel using table A3
or.l    0(A4,d3.w),d0       ;convert rightmost pixel using table A4
movep.l         d0,X(A5)    ;put to screen buffer

Why waste precious addressing capabilities if you can actually put 2 pixels of 3 bits colour depth in the static offsets ? Usually, c2p code is unrolled and even more often, it’s being generated so that the BLiTTER can easily fill in the static offsets:

movem.w         (A0)+,d0-d3 ;fetch some sort of 4 chunky pixels
move.l  l1r1(A1,d0.w),d0    ;combine with background and convert
or.l    l2r2(A2,d1.w),d0    ;combine with background and convert
or.l    l3r3(A3,d2.w),d0    ;combine with background and convert
or.l    l4r4(A4,d3.w),d0    ;combine with background and convert
movep.l            d0,X(A5) ;put to screen buffer

With l1, l2, l3 and l4 denoting a “left” background pixel while r1, r2, r3 and r4 refer to a “right” background pixel. If these are static, they can easily be placed in the generation of the code, otherwise, the BLiTTER will gladly update them easily, as the static offset is simply the last byte of the instruction and employs the following format:

Bit   7  6  5  4  3  2  1  0
     |______| |______|
   left bgnd. right bgnd.

This approach also leaves the chunky pixel(s) being read completely free to be used in any manner that suits the underlying effect, for example carrying 2 chunky pixels plus 2 bits of additional alpha layer information per pixel as explained above. Naturally, setting up the m2px-tables does involve some thinking then, but it’s possible.

How will BLiTTER code look when updating, for example, the background pixels in the generated m2px-code that looks a lot like the example code listed above:

First of all, to speed things up, the background should exist in the compressed 2 x 3 bit format explained above. Then the ENDMASKs need to be set up to not touch any other part of the instructions, e.g. all 3 to $00FC. If the compressed format is not available, the BLiTTER can create it for you easily - it will merely require 2 runs then, one to generate the left half using ENDMASKs $00E0, and one using ENDMASKs $001C. By cleverly setting up halftone patterns, SKEW and SMUDGE can easily be used to speed up the BLiTTER, too. Then, carefully read the amount of words between the words the BLiTTER needs to modify, these are the destination X increment. Also carefully count the words between the last and the first instruction the BLiTTER can update using the destination X increment calculated above (for example, where the movep.l is located). This is going to be the destination Y increment. Source X and Y increments depend on the source, naturally. NFSR and FXSR are not required but after each line, the BLiTTER will require updates on destination and maybe also source address. But that’s it.

BLiTTER modified code.

Easy as that.

7.) Extremely restricted fixed point increments for the BLiTTER

So as far the current explanation went, the BLiTTER only handles integer increments and offsets and often enough, only even values as it is operating on 16 bit words all the time. What do people refer to if they actually talk about zooming or maybe even rotozooming a graphics block using the BLiTTER ? Actually, the BLiTTER can sort of handle fixed point increments, simply by shifting the dot. Meaning, the BLiTTER can only increase a source or destination increment that are multiples of 2, but what if the next source pixel wasn’t 2 bytes away but 8 ? Then, an increment of 2 would not increase the source pointer by 1 (or 2 pixels for a byte based buffer) but by 14 pixel. Making the BLiTTER effectively operate on fixed point numbers. Naturally, this actually requires the source graphics block to contain every pixel in multiple instances, making it consume way more memory, still, the precision is very limited and not near, for example, 8.8 Bit fixed point which is fairly popular on the 68000. Yet, it works well for small buffers.

8.) BLiTTER in true colour effects (thanks to Ray of .tSCc.)

So far, we only used the BLiTTER for effects typically done on the STE, either using bitplanes directly or cleverly arranging c2p- effects to take advantage of the BLiTTER.

Can it do anything useful on, for example, the Falcon in its true colour mode, too ?

Yes, it actually can.

Let’s take a closer look at the Falcon’s true colour pixel format:

Bit    15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
       |____________| |_______________| |____________|
            Red             Green           Blue

Let’s say, using just 4 bits for red, green and blue each would still be sufficient for the desired effect, the pixel format could be reduced to

Bit    15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
       |_________|    |_________|       |_________|
          Red            Green              Blue

This would actually limit the number of colours available to 4096 colours, but still, in a compact manner and all of them on one screen.

Naturally, these 4 bits can now easily by modified in any desired way using halftone patterns and the SMUDGE mode again, for example for fading down by 1 colour value. The halftone pattern could then be set up like this:

Line  0 - %0000 0 0000 00 0000 0 =  0 - 1 =  0 (underflow protect)
Line  1 - %0000 0 0000 00 0000 0 =  1 - 1 =  0
Line  2 - %0001 0 0001 00 0001 0 =  2 - 1 =  1
Line  3 - %0010 0 0010 00 0010 0 =  3 - 1 =  2
...
Line 15 - %1110 0 1110 00 1110 0 = 15 - 1 = 14

In each line, the original value in red, green and blue is decreased by one at the same time. But by using ENDMASKs in an intelligent way, the colour value not due for updating can easily be masked out:

 First BLiTTER run to decrease red   - Set ENDMASKs to $F000
Second BLiTTER run to decrease green - Set ENDMASKs to $0780
 Third BLiTTER run to decrease blue  - Set ENDMASKs to $001E

Also, SKEW values need to be set accordingly:

 First BLiTTER run to decrease red   - Set SKEW to 12
Second BLiTTER run to decrease green - Set SKEW to 7
 Third BLiTTER run to decrease blue  - Set SKEW to 1

And there you go. As each word costs merely 4 cycles to modify, it is still fairly quick (12 cycles for decreasing red, green and blue) and costs basically no memory for any table. It also allows to modify the individual channels individually, by for example fading down red, fading up green and keeping blue as it is.

The above concept does not only allow to fade down/up quickly using the BLiTTER but also leaves 4 bits free to potentially implement an 4-bit alpha layer, right ? This can also be filled easily by the BLiTTER, using SMUDGE, a decently set up halfone pattern and appropriate SKEWing.

Even more obscure, you can even use the segmented pixel format shown above. By using SMUDGE mode and appropriately set up halftone data, the BLiTTER can produce the segmented data easily and by using the right ENDMASKs, the BLiTTER will even fill the segmented bits of the alpha layer into the correct position.

Other pixel modifications can be implemented similarly but as a rule of thumb, the BLiTTER will only outspeed the CPU if SMUDGE mode can be used. Just copying around true colour pixel data using the BLiTTER is usually not faster than doing it using the CPU.

9.) Other applications than graphics

Already in 1999, TAM of T.O.Y.S proposed writing a BLiTTER based MOD- player. It never came into existance, and thinking about it, the BLiTTER by itself would also be unable to re-scale the samples at a sufficient precision. However, spreading sampled data to simulate 4 or 8 DMA-channels would be easy to do by simply using other source than destination increments.

It would be possible to have the DMA sound run over a given memory segment over and over again and update the sampled sound data using the BLiTTER directly after the DMA sound has played a certain part of it. This would ease up streaming sampled sound data from a slowish source, such as a disk drive.

d.) Common mistakes

The BLiTTER is usually not that hard to get used to, but a few common mistakes are usually being done regularly and as the BLiTTER works independantly of the CPU, it can become tricky to actually debug the code that turns out to be not working as planned. Here are a few:

? Starting the BLiTTER does something but it appears to be incomplete when the CPU tries to procede operating on the data. ! Please note that in cooperative mode, the BLiTTER is not done yet when the CPU gets control over the bus again. Either use the trick proposed by Atari to accelerate the BLiTTER or use it in HOG mode.

? Tried but it still seems corrupted, especially the interrupt services are getting out of sync very often. ! Please note that in BLiT mode, the BLiTTER will always arbitrate for the bus after 64 cycles of idle time, no matter what the CPU is doing. It will interfere with the interrupt service routines. If the timing of the interrupt service routines is critical, the BLiTTER should only be used with great care and in HOG mode where appropriate.

? It seems to work, but after a while, the system seems to crash? ! Sounds like you’re not writing the number of lines correctly. Please note that this is one of those registers you need to update ALWAYS before starting the BLiTTER.

? Wah! My chunky graphics are corrupted by the BLiTTER! ! Please remember that the BLiTTER ONLY operates on even addresses. Even if your chunky graphics are using byte sized pixels, they must be located on an even address if you want the BLiTTER to correctly operate on these as it does not allow to set any odd address.

? Sprites work fine at certain positions but are screwed up at others ! Commonly, shifting sprites requires a lot of attention regarding the NFSR and FXSR bits, especially if the SKEW-value is greater than 0.

? Tried. Still looks screwed up. ! Please also note that as soon as NFSR or FXSR are set, the Y increments need to be revised. See explanation above.

? It seems to work, but there’s defective data to the left or right of the BLiTTER sprite every now and then. ! The BLiTTER never clears its internal data buffers so that if you’re using SKEW, NFSR and FXSR, outdated data remains in the buffer and will be displayed if not masked out by the ENDMASKs.

? My SMUDGE-mode effect only seems to use a few lines of the halftone pattern, but not all 16. Why ? ! Please note that the SMUDGE-mode selects the halftone-line by evaluating bits 3..0. You need to make sure the value read by the BLiTTER uses these 4 bits, if necessary, by shifting it down using the SKEW-register. The BLiTTER, by the way, doesn’t really care about bits 15 to 4 when using SMUDGE mode.

? I tried some of the Paradox demos on my PC with an emulator. They look badly screwed up and therefore, they suck. Learn how to code! ! It’s hardly our fault if emulators are not complete regarding the BLiTTER emulation. Ask your favourite emulator programmer to improve the BLiTTER emulation or run the demo on the original machine but don’t blame us.

? I read on de.wikipedia.org that you’re irrelevant. ! Yes, we were surprised about that, too. If the so-called “expert on the matter” - we’d rather call him a “has-been” - considers one of the few groups still doing things irrelevant, it makes you wonder what groups are left to be considered relevant.

e.) Compatibility

Programming the BLiTTER can massively accelerate your code, but what can you do to make sure the machine you program for has one ? Here’s a list:

  • Atari 520/1040ST, STf, STm, STfm, Mega and Mega ST Commonly, these machines had no BLiTTER. Not even all Mega and Mega ST had one initially, Atari merely mentioned in the owner’s manual that if the machine does not allow to activate a “Blitter” in the “Extras” menu of the desktop, the machine can be fitted with one at the local Atari dealer. Otherwise, later Atari Mega and Mega ST models were shipped with one.

  • Atari STE and MegaSTE These machines had a BLiTTER. All of them did. As it’s included in the so-called COMBEL chip, it’s present in all STE and MegaSTE. Please note that in the MegaSTE, BLiTTER timing is slightly different as the BLiTTER requires more idle cycles before arbitrating the bus. This may become critical for extremely hard synchronized effects. As a result, starting the BLiTTER on a MegaSTE is delayed by one “nop” instruction of the CPU which needs to be taken into account. Some senseless techno babble: The BLiTTER requires 4 to 7 CPU cycles to arbitrate the bus on a regular ST or STE, leaving the CPU enough time to finish a bus access. On the MegaSTE, the CPU accesses the bus through a cache, which, unfortunately, Atari has never disclosed any internals of. However, the cache controller is likely to organize the cache in “lines” of 16 bytes each. To either read or write a complete line in one go, the cache needs to do 8 subsequent accesses, reading or writing 16 bits at once. This implies 8 bus cycles (minimally), so that it’s likely that Atari simple changed the minimum “slot” which is granted to the CPU/cache from 4 cycles on the ST/STE to 8 cycles on the MegaSTE, making the BLiTTER require 8 to 11 cycles to arbitrate the bus.

  • Atari TT This model had no BLiTTER. Never.

  • Atari Falcon030 The Falcon030 had one in all configurations and as a special bonus, the BLiTTER in the Falcon can be switched to either 8MHz or 16MHz by writing a “1” to bit 2 of $FFFF8007 for 16MHz or clearing bit 2 of $FFFF8007 for 8MHz.

  • Atari compatibles such as MedusaT40, Hades040/060 or Milan None of these had a BLiTTER as they all were aimed at running more advanced graphics cards which might have internal BLiTTER-like accelerators but were not accessible through a memory mapped interface.

  • Emulators: Most of the emulators focussed on GEM-compatibility such as GEMulator, WinSTon, TOS2WIN, STemulator etc. feature no BLiTTER-emulation, at least not on a hardware level. SainT does not support FXSR and NFSR and halftone operations also seem to be completely malfunctioning, otherwise, simple blits seem to work. The most popular emulator STEem features most parts of the BLiTTER, however, Halftone-operations are restricted and SMUDGE-mode is not supported at all. Hatari supports all of the BLiTTER features and the timing is accurate, at least if you configure it to 8MHz CPU clock.

How do you actually find out whether the machine is equipped with a BLiTTER at runtime to make sure it has one ? By calling XBIOS #64. This function accepts one parameter, the mode:

mode = -1: Do not alter XBIOS blit mode, just return current mode, mode = 0: Deactivate BLiTTER, return previous mode and mode = 1: Activate BLiTTER, return previous mode

The return value merely uses bit 0 to report the XBIOS blit mode and bit 1 to report whether a BLiTTER is available in hardware:

Blitmode = 0: No BLiTTER available and XBIOS is not using one, Blitmode = 1: No BLiTTER available but enabled by XBIOS (impossible), Blitmode = 2: BLiTTER available but disabled by XBIOS and Blitmode = 3: BLiTTER available and enabled by XBIOS.

Please note that this function does not exist at TOS revisions before TOS 1.02 so that calling XBIOS #64 will fail on TOS 1.00. Therefore, to really make sure,

a.) check word at address $00000002, b.) If that is $0100, the machine is equipped with TOS 1.00 does not have XBIOS #64, so don’t even try. It is also unlikely that the machine has a BLiTTER -> Set “No BLiTTER” c.) If that exceeds $0100, call XBIOS #64 and check bit 1 of the return value: If that is “1” -> Set “BLiTTER”, if that is “0” -> Set “No BLiTTER”

f.) Summary

So this is the BLiTTER. As you have read above, it can do a few funky tricks, it has a couple of restrictions, it can assist a few other things and it’s most certainly not restricted to merely copying bit- blocks around. In the golden era of Atari ST(E) demos, the BLiTTER was usually being ignored. If at all, STE demos mainly focussed on hardware scrolling and DMA sound, but, if at all, the BLiTTER was usually only being used for a couple of sprites and that’s about it. Now that there are more complex algorithms such as c2p, it was about time to set the record straight and show how the BLiTTER can support these new-school effects. Hopefully, this tutorial has motivated a few people to toy around with the BLiTTER a bit and probably add one or two new effects to the list below. If so, this tutorial has served its purpose.

The Paranoid of Paradox 2012.

Appendix: Demo Resource List

BLiTTER sprites: Sonic in “Pacemaker” BLiTTER fade down: Smearing 3D bobs in “Pacemaker” BLiTTER melt-o-vision: Nothern Lights effect in “Pacemaker”

Random Sized Sprites: Jigsaw Puzzle Zoomer in “again” ENDMASK management: Twister in “again”

BLiTTER zooming: Radial Blur in “Blue Period” Max2p: “Alternative Party Invitro” (incl. BLiTTER blur) M2px: Overlaid Tunnel in “Blue Period”

BLiTTER modified code: Circle interference effect in “Blue Period” BLiTTER modified code: Zooming double-alpha layer scroll in “SV2011”

Appendix: Original Atari BLiTTER sample code

* (c) 1987 Atari Corporation
*     All rights reserved
*
* BLiTTER BASE ADDRESS

BLiTTER         equ     $FF8A00

* BLiTTER REGISTER OFFSETS

Halftone        equ     0
Src_Xinc        equ     32
Src_Yinc        equ     34
Src_Addr        equ     36
Endmask1        equ     40
Endmask2        equ     42
Endmask3        equ     44
Dst_Xinc        equ     46
Dst_Yinc        equ     48
Dst_Addr        equ     50
X_Count         equ     54
Y_Count         equ     56
HOP             equ     58
OP              equ     59
Line_Num        equ     60
Skew            equ     61

* BLiTTER REGISTER FLAGS

fHOP_Source     equ     1
fHOP_Halftone   equ     0

fSkewFXSR       equ     7
fSkewNFSR       equ     6

fLineBusy       equ     7
fLineHog        equ     6
fLineSmudge     equ     5

* BLiTTER REGISTER MASKS

mHOP_Source     equ     $02
mHOP_Halftone   equ     $01

mSkewFXSR       equ     $80
mSkewNFSR       equ     $40

mLineBusy       equ     $80
mLineHog        equ     $40
mLineSmudge     equ     $20

*                     E n D m A s K   d A t A
*
* These tables are referenced by PC relative instructions.  Thus,
* the labels on these tables must remain within 128 bytes of the
* referencing instructions forever.  Amen.
*
* 0: Destination  1: Source   <<< Invert right end mask data >>>

lf_endmask:
        dc.w    $FFFF
rt_endmask:
        dc.w    $7FFF
        dc.w    $3FFF
        dc.w    $1FFF
        dc.w    $0FFF
        dc.w    $07FF
        dc.w    $03FF
        dc.w    $01FF
        dc.w    $00FF
        dc.w    $007F
        dc.w    $003F
        dc.w    $001F
        dc.w    $000F
        dc.w    $0007
        dc.w    $0003
        dc.w    $0001
        dc.w    $0000

* TiTLE:    BLiT_iT
*
* PuRPoSE:  Transfer a rectangular block of pixels located at an
*           arbitrary X,Y position in the source memory form to
*           another arbitrary X,Y position in the destination memory
*           form using replae mode (boolean operator 3).
*           The source and destination rectangles should not overlap.
*
* iN:
*       a4     pointer to 34 byte input parameter block
*
* NoTe:  This routine must be executed in supervisor mode as access
*        is made to hardware registers in the protected region of the
*        memory map.
*
*
*       I n P u T   p A r A m E t E r   B l O c K   o F f S e T s

SRC_FORM        equ      0 ; Base address of source memory form        .l
SRC_NXWD        equ      4 ; Offset between words in source plane      .w
SRC_NXLN        equ      6 ; Source form width                         .w
SRC_NXPL        equ      8 ; Offset between source planes              .w
SRC_XMIN        equ     10 ; Source blit rectangle minimum X           .w
SRC_YMIN        equ     12 ; Source blit rectangle minimum Y           .w

DST_FORM        equ     14 ; Base address of destination memory form   .l
DST_NXWD        equ     18 ; Offset between words in destination plane .w
DST_NXLN        equ     20 ; Destination form width                    .w
DST_NXPL        equ     22 ; Offset between destination planes         .w
DST_XMIN        equ     24 ; Destination blit rectangle minimum X      .w
DST_YMIN        equ     26 ; Destination blit rectangle minimum Y      .w

WIDTH           equ     28 ; Width of the blit rectangle               .w
HEIGHT          equ     30 ; Height of the blit rectangle              .w
PLANES          equ     32 ; Number of planes to blit                  .w

BLiT_iT:

        lea     BLiTTER,a5         ; a5-> BLiTTER register block

*
* Calculate Xmax coordinates from Xmin coordinates and width
*
        move.w  WIDTH(a4),d6
        subq.w  #1,d6              ; d6<- width-1

        move.w  SRC_XMIN(a4),d0    ; d0<- src Xmin
        move.w  d0,d1
        add.w   d6,d1              ; d1<- src Xmax = src Xmin + width-1

        move.w  DST_XMIN(a4),d2    ; d2<- dst Xmin
        move.w  d2,d3
        add.w   d6,d3              ; d3<- dst Xmax = dst Xmin + width-1

*
* Endmasks are derived from source Xmin mod 16 and source Xmax mod 16
*
        moveq.l #$0F,d6            ; d6<- mod 16 mask

        move.w  d2,d4              ; d4<- DST_XMIN
        and.w   d6,d4              ; d4<- DST_XMIN mod 16
        add.w   d4,d4              ; d4<- offset into left end mask tbl
        move.w  lf_endmask(pc,d4.w),d4 ; d4<- left endmask

        move.w  d3,d5              ; d5<- DST_XMAX
        and.w   d6,d5              ; d5<- DST_XMAX mod 16
        add.w   d5,d5              ; d5<- offset into right end mask tbl
        move.w  rt_endmask(pc,d5.w),d5 ; d5<- inverted right end mask
        not.w   d5                 ; d5<- right end mask

*
* Skew value is (destination Xmin mod 16 - source Xmin mod 16)
* && 0x000F.  Three discriminators are used to determine the
* states of FXSR and NFSR flags:
*
*    bit 0           0: Source Xmin mod 16 =< Destination Xmin mod 16
*                    1: Source Xmin mod 16 >  Destination Xmin mod 16
*
*    bit 1           0: SrcXmax/16-SrcXmin/16 <> DstXmax/16-DstXmin/16
*                          Source span              Destination span
*                    1: SrcXmax/16-SrcXmin/16 == DstXmax/16-DstXmin/16
*
*    bit 2           0: multiple word Destination span
*                    1: single word Destination span
*
*    These flags form an offset into a skew flag table yielding
*    correct FXSR and NFSR flag states for the given source and
*    destination alignments
*

        move.w  d2,d7              ; d7<- Dst Xmin
        and.w   d6,d7              ; d7<- Dst Xmin mod 16
        and.w   d0,d6              ; d6<- Src Xmin mod 16
        sub.w   d6,d7              ; d7<- Dst Xmin mod16-Src Xmin mod16
*                                  ; if Sx&F > Dx&F then cy:1 else cy:0
        clr.w   d6                 ; d6<- initial skew flag table index
        addx.w  d6,d6              ; d6[bit0]<- introword alignment flag

        lsr.w   #4,d0              ; d0<- word offset to src Xmin
        lsr.w   #4,d1              ; d1<- word offset to src Xmax
        sub.w   d0,d1              ; d1<- Src span - 1

        lsr.w   #4,d2              ; d2<- word offset to dst Xmin
        lsr.w   #4,d3              ; d3<- word offset to dst Xmax
        sub.w   d2,d3              ; d3<- Dst span - 1
        bne     set_endmasks       ; 2nd discriminator is one word dst
* When destination spans a single word, both end masks are merged
* into Endmask1.  The other end masks will be ignored by the BLiTTER

        and.w   d5,d4              ; d4<- single word end mask
        addq.w  #4,d6              ; d6[bit2]:1 => single word dst

set_endmasks:

        move.w  d4,Endmask1(a5)    ; left end mask
        move.w  #$FFFF,Endmask2(a5) ; center end mask
        move.w  d5,Endmask3(a5)    ; right end mask

        cmp.w   d1,d3              ; the last discriminator is the
        bne     set_count          ; equality of src and dst spans

        addq.w  #2,d6              ; d6[bit1]:1 => equal spans

set_count:

        move.w  d3,d4
        addq.w  #1,d4              ; d4<- number of words in dst line
        move.w  d4,X_Count(a5)     ; set value in BLiTTER

* Calculate Source Starting address:
*
*   Source Form address              +
*  (Source Ymin * Source Form Width) +
* ((Source Xmin/16) * Source Xinc)

        move.l  SRC_FORM(a4),a0    ; a0-> start of Src form
        move.w  SRC_YMIN(a4),d4    ; d4<- offset in lines to Src Ymin
        move.w  SRC_NXLN(a4),d5    ; d5<- length of Src form line
        mulu    d5,d4              ; d4<- byte offset to (0, Ymin)
        add.l   d4,a0              ; a0-> (0, Ymin)

        move.w  SRC_NXWD(a4),d4    ; d4<- offset between consecutive
        move.w  d4,Src_Xinc(a5)    ;      words in Src plane

        mulu    d4,d0              ; d0<- offset to word containing Xmin
        add.l   d0,a0              ; a0-> 1st src word (Xmin, Ymin)

* Src_Yinc is the offset in bytes from the last word of one Source
* line to the first word of the next Source line

        mulu    d4,d1              ; d1<- width of src line in bytes
        sub.w   d1,d5              ; d5<- value added to pointer at end
        move.w  d5,Src_Yinc(a5)    ;      of line to reach start of next

* Calculate Destination starting address

        move.l  DST_FORM(a4),a1    ; a1-> start of dst form
        move.w  DST_YMIN(a4),d4    ; d4<- offset in lines to dst Ymin
        move.w  DST_NXLN(a4),d5    ; d5<- width of dst form

        mulu    d5,d4              ; d4<- byte offset to (0, Ymin)
        add.l   d4,a1              ; a1-> dst (0, Ymin)

        move.w  DST_MXWD(a4),d4    ; d4<- offset between consecutive
        move.w  d4,Dst_Xinc(a5)    ;      word in dst plane

        mulu    d4,d2              ; d2<- DST_NXWD * (DST_XMIN/16)
        add.l   d2,a1              ; a1-> 1st dst word (Xmin, Ymin)

* Calculate Destination Yinc

        mulu    d4,d3              ; d3<- width of dst line - DST_NXWD
        sub.w   d3,d5              ; d5<- value added to dst pointer at
        move.w  d5,Dst_Yinc(a5)    ;      end of line to reach next line

* The low nibble of the difference in Source and Destination alignment
* is the skew value.  Use the skew flag index to reference FXSR and NFSR
* states in skew flag table.

        and.b   #$0F,d7                ; d7<- isolated skew count
        or.b    skew_flags(pc,d6.w),d7 ; d7<- necessary flags and skew
        move.b  d7,Skew(a5)            ; load Skew register

        move.b  #mHOP_Source,HOP(a5)   ; set HOP to source only
        move.b  #3,OP(a5)              ; set OP to "replace" mode

        lea     Line_Num(a5),a2        ; fast ref to Line_Num register
        move.b  #fLineBusy,d2          ; fast ref to LineBusy flag
        move.w  PLANES(a4),d7          ; d7<- plane counter
        bra     begin

*
*          T h E   s E t T i N g   O f   S k E w    F l A g S
*
*
* QUALIFIERS   ACTIONS           BITBLT DIRECTION: LEFT -> RIGHT
*
* equal Sx&F>
* spans Dx&F FXSR NFSR
* 0     0     0    1 |..ssssssssssssss|ssssssssssssss..|
*                    |......dddddddddd|dddddddddddddddd|dd..............|
*
* 0     1     1    0 |......ssssssssss|ssssssssssssssss|ss..............|
*                    |..dddddddddddddd|dddddddddddddd..|
*
* 1     0     0    0 |..ssssssssssssss|ssssssssssssss..|
*                    |...ddddddddddddd|ddddddddddddddd.|
*
* 1     1     1    1 |...sssssssssssss|sssssssssssssss.|
*                    |..dddddddddddddd|dddddddddddddd..|


skew_flags:

        dc.b    mSkewNFSR          ; Source span < Destination span
        dc.b    mSkewFXSR          ; Source span > Destination span
        dc.b    0                  ; Spans equal Shift Source right
        dc.b    mSkewNFSR+mSkewFXSR ; Spans equal Shift Source left

* When Destination span is but a single word ...

        dc.b    0                  ; Implies a Source span of no words
        dc.b    mSkewFXSR          ; Source span of two words
        dc.b    0                  ; Skew flags aren't set if Source and
        dc.b    0                  ; Destination spans are both one word

next_plane:

        move.l  a0,Src_Addr(a5)    ; load Source pointer to this plane
        move.l  a1,Dst_Addr(a5)    ; load Destination ptr to this plane
        move.w  HEIGHT(a4),Y_Count(a5) ; load the line count

        move.b  #mLineBusy,(a2)    ; <<< start the BLiTTER >>>

        add.w   SRC_NXPL(a4),a0    ; a0-> start of next src plane
        add.w   DST_NXPL(a4),a1    ; a1-> start of next dst plane



* The BLiTTER is usually operated with the HOG flag cleared.
* In this mode the BLiTTER and the ST's CPU share the bus equally,
* each taking 64 bus cycles while the other is halted.  This mode
* allows interrupts to be fielded by the CPU while an extensive
* BitBlt is being processed by the BLiTTER.  There is a drawback in
* that BitBlts in this shared mode may take twice as long as BitBlts
* executed in hog mode.  Ninety percent of hog mode performance is
* achieved while retaining robust interrupt handling via a method
* of prematurely restarting the BLiTTER.  When control is returned
* to the cpu by the BLiTTER, the cpu immediatelly resets the BUSY
* flag, restarting the BLiTTER after just 7 bus cycles rather than
* after the usual 64 cycles.  Interrupts pending will be serviced
* before the restart code regains control.  If the BUSY flag is
* reset when the Y_Count is zero, the flag will remain clear
* indicating BLiTTER completion and the BLiTTER won't be restarted.
*
* (Interrupt service routines may explicitely halt the BLiTTER
* during execution time critical sections by clearing the BUSY flag.
* The original BUSY flag state must be restored however, before
* termination of the interrupt service routine.)


restart:

        bset.b  d2,(a2)            ; Restart BLiTTER and test the BUSY
        nop                        ; flag state.  The "nop" is executed
        bne     restart            ; prior to the BLiTTER restarting.
*                                  ; Quit if the BUSY flag was clear.

begin:  dbra    d7,next_plane

        rts

Appendix: Sophisticated ENDMASK, NFSR/FXSR/Skew tables

These two tables,

bl_shifttab for copies exceeding 16 pixels per line and

bl_shifttabs for copies up to 16 pixels per line

can make getting started with the BLiTTER a lot easier.

bl_shifttab:
;
; This fairly complex table is meant to be used with an unsigned word offset:
;   High nibble 0..15 = Number of shifts   AND 15
;   Low nibble  0..15 = Size of the sprite AND 15
;     premultiplied by 8 because each line consists of 8 Bytes
;
; Word 0: ENDMASK 1, only depends on the number of shifts
; Word 1: ENDMASK 3, depends on both number of shifts and size
; Word 2: SKEW,      depends primarily on number of shifts and combination of both
; Word 3: Overflow,  depends on size+shifts, multiple of 8 plus rounding upwards
;                    If desired, it can help calculating the increment from line
;                    to line before dividing by 16-pixels-per-BLiT depending on
;                    NFSR and FXSR as defined in the word at offset 4
;
                DC.W $FFFF,$FFFF,$00,15 ;    size 16/16 pixels ; 0 Shifts
                DC.W $FFFF,$8000,$00,15 ;    size  1/16 pixels
                DC.W $FFFF,$C000,$00,15 ;    size  2/16 pixels
                DC.W $FFFF,$E000,$00,15 ;    size  3/16 pixels
                DC.W $FFFF,$F000,$00,15 ;    size  4/16 pixels
                DC.W $FFFF,$F800,$00,15 ;    size  5/16 pixels
                DC.W $FFFF,$FC00,$00,15 ;    size  6/16 pixels
                DC.W $FFFF,$FE00,$00,15 ;    size  7/16 pixels
                DC.W $FFFF,$FF00,$00,15 ;    size  8/16 pixels
                DC.W $FFFF,$FF80,$00,15 ;    size  9/16 pixels
                DC.W $FFFF,$FFC0,$00,15 ;    size 10/16 pixels
                DC.W $FFFF,$FFE0,$00,15 ;    size 11/16 pixels
                DC.W $FFFF,$FFF0,$00,15 ;    size 12/16 pixels
                DC.W $FFFF,$FFF8,$00,15 ;    size 13/16 pixels
                DC.W $FFFF,$FFFC,$00,15 ;    size 14/16 pixels
                DC.W $FFFF,$FFFE,$00,15 ;    size 15/16 pixels

                DC.W $7FFF,$8000,$41,31 ;    size 16/16 pixels ; 1 Shifts
                DC.W $7FFF,$C000,$01,15 ;    size  1/16 pixels
                DC.W $7FFF,$E000,$01,15 ;    size  2/16 pixels
                DC.W $7FFF,$F000,$01,15 ;    size  3/16 pixels
                DC.W $7FFF,$F800,$01,15 ;    size  4/16 pixels
                DC.W $7FFF,$FC00,$01,15 ;    size  5/16 pixels
                DC.W $7FFF,$FE00,$01,15 ;    size  6/16 pixels
                DC.W $7FFF,$FF00,$01,15 ;    size  7/16 pixels
                DC.W $7FFF,$FF80,$01,15 ;    size  8/16 pixels
                DC.W $7FFF,$FFC0,$01,15 ;    size  9/16 pixels
                DC.W $7FFF,$FFE0,$01,15 ;    size 10/16 pixels
                DC.W $7FFF,$FFF0,$01,15 ;    size 11/16 pixels
                DC.W $7FFF,$FFF8,$01,15 ;    size 12/16 pixels
                DC.W $7FFF,$FFFC,$01,15 ;    size 13/16 pixels
                DC.W $7FFF,$FFFE,$01,15 ;    size 14/16 pixels
                DC.W $7FFF,$FFFF,$01,15 ;    size 15/16 pixels

                DC.W $3FFF,$C000,$42,31 ;    size 16/16 pixels ; 2 Shifts
                DC.W $3FFF,$E000,$02,15 ;    size  1/16 pixels
                DC.W $3FFF,$F000,$02,15 ;    size  2/16 pixels
                DC.W $3FFF,$F800,$02,15 ;    size  3/16 pixels
                DC.W $3FFF,$FC00,$02,15 ;    size  4/16 pixels
                DC.W $3FFF,$FE00,$02,15 ;    size  5/16 pixels
                DC.W $3FFF,$FF00,$02,15 ;    size  6/16 pixels
                DC.W $3FFF,$FF80,$02,15 ;    size  7/16 pixels
                DC.W $3FFF,$FFC0,$02,15 ;    size  8/16 pixels
                DC.W $3FFF,$FFE0,$02,15 ;    size  9/16 pixels
                DC.W $3FFF,$FFF0,$02,15 ;    size 10/16 pixels
                DC.W $3FFF,$FFF8,$02,15 ;    size 11/16 pixels
                DC.W $3FFF,$FFFC,$02,15 ;    size 12/16 pixels
                DC.W $3FFF,$FFFE,$02,15 ;    size 13/16 pixels
                DC.W $3FFF,$FFFF,$02,15 ;    size 14/16 pixels
                DC.W $3FFF,$8000,$42,31 ;    size 15/16 pixels

                DC.W $1FFF,$E000,$43,31 ;    size 16/16 pixels ; 3 Shifts
                DC.W $1FFF,$F000,$03,15 ;    size  1/16 pixels
                DC.W $1FFF,$F800,$03,15 ;    size  2/16 pixels
                DC.W $1FFF,$FC00,$03,15 ;    size  3/16 pixels
                DC.W $1FFF,$FE00,$03,15 ;    size  4/16 pixels
                DC.W $1FFF,$FF00,$03,15 ;    size  5/16 pixels
                DC.W $1FFF,$FF80,$03,15 ;    size  6/16 pixels
                DC.W $1FFF,$FFC0,$03,15 ;    size  7/16 pixels
                DC.W $1FFF,$FFE0,$03,15 ;    size  8/16 pixels
                DC.W $1FFF,$FFF0,$03,15 ;    size  9/16 pixels
                DC.W $1FFF,$FFF8,$03,15 ;    size 10/16 pixels
                DC.W $1FFF,$FFFC,$03,15 ;    size 11/16 pixels
                DC.W $1FFF,$FFFE,$03,15 ;    size 12/16 pixels
                DC.W $1FFF,$FFFF,$03,15 ;    size 13/16 pixels
                DC.W $1FFF,$8000,$43,31 ;    size 14/16 pixels
                DC.W $1FFF,$C000,$43,31 ;    size 15/16 pixels

                DC.W $0FFF,$F000,$44,31 ;    size 16/16 pixels ; 4 Shifts
                DC.W $0FFF,$F800,$04,15 ;    size  1/16 pixels
                DC.W $0FFF,$FC00,$04,15 ;    size  2/16 pixels
                DC.W $0FFF,$FE00,$04,15 ;    size  3/16 pixels
                DC.W $0FFF,$FF00,$04,15 ;    size  4/16 pixels
                DC.W $0FFF,$FF80,$04,15 ;    size  5/16 pixels
                DC.W $0FFF,$FFC0,$04,15 ;    size  6/16 pixels
                DC.W $0FFF,$FFE0,$04,15 ;    size  7/16 pixels
                DC.W $0FFF,$FFF0,$04,15 ;    size  8/16 pixels
                DC.W $0FFF,$FFF8,$04,15 ;    size  9/16 pixels
                DC.W $0FFF,$FFFC,$04,15 ;    size 10/16 pixels
                DC.W $0FFF,$FFFE,$04,15 ;    size 11/16 pixels
                DC.W $0FFF,$FFFF,$04,15 ;    size 12/16 pixels
                DC.W $0FFF,$8000,$44,31 ;    size 13/16 pixels
                DC.W $0FFF,$C000,$44,31 ;    size 14/16 pixels
                DC.W $0FFF,$E000,$44,31 ;    size 15/16 pixels

                DC.W $07FF,$F800,$45,31 ;    size 16/16 pixels ; 5 Shifts
                DC.W $07FF,$FC00,$05,15 ;    size  1/16 pixels
                DC.W $07FF,$FE00,$05,15 ;    size  2/16 pixels
                DC.W $07FF,$FF00,$05,15 ;    size  3/16 pixels
                DC.W $07FF,$FF80,$05,15 ;    size  4/16 pixels
                DC.W $07FF,$FFC0,$05,15 ;    size  5/16 pixels
                DC.W $07FF,$FFE0,$05,15 ;    size  6/16 pixels
                DC.W $07FF,$FFF0,$05,15 ;    size  7/16 pixels
                DC.W $07FF,$FFF8,$05,15 ;    size  8/16 pixels
                DC.W $07FF,$FFFC,$05,15 ;    size  9/16 pixels
                DC.W $07FF,$FFFE,$05,15 ;    size 10/16 pixels
                DC.W $07FF,$FFFF,$05,15 ;    size 11/16 pixels
                DC.W $07FF,$8000,$45,31 ;    size 12/16 pixels
                DC.W $07FF,$C000,$45,31 ;    size 13/16 pixels
                DC.W $07FF,$E000,$45,31 ;    size 14/16 pixels
                DC.W $07FF,$F000,$45,31 ;    size 15/16 pixels

                DC.W $03FF,$FC00,$46,31 ;    size 16/16 pixels ; 6 Shifts
                DC.W $03FF,$FE00,$06,15 ;    size  1/16 pixels
                DC.W $03FF,$FF00,$06,15 ;    size  2/16 pixels
                DC.W $03FF,$FF80,$06,15 ;    size  3/16 pixels
                DC.W $03FF,$FFC0,$06,15 ;    size  4/16 pixels
                DC.W $03FF,$FFE0,$06,15 ;    size  5/16 pixels
                DC.W $03FF,$FFF0,$06,15 ;    size  6/16 pixels
                DC.W $03FF,$FFF8,$06,15 ;    size  7/16 pixels
                DC.W $03FF,$FFFC,$06,15 ;    size  8/16 pixels
                DC.W $03FF,$FFFE,$06,15 ;    size  9/16 pixels
                DC.W $03FF,$FFFF,$06,15 ;    size 10/16 pixels
                DC.W $03FF,$8000,$46,31 ;    size 11/16 pixels
                DC.W $03FF,$C000,$46,31 ;    size 12/16 pixels
                DC.W $03FF,$E000,$46,31 ;    size 13/16 pixels
                DC.W $03FF,$F000,$46,31 ;    size 14/16 pixels
                DC.W $03FF,$F800,$46,31 ;    size 15/16 pixels

                DC.W $01FF,$FE00,$47,31 ;    size 16/16 pixels ; 7 Shifts
                DC.W $01FF,$FF00,$07,15 ;    size  1/16 pixels
                DC.W $01FF,$FF80,$07,15 ;    size  2/16 pixels
                DC.W $01FF,$FFC0,$07,15 ;    size  3/16 pixels
                DC.W $01FF,$FFE0,$07,15 ;    size  4/16 pixels
                DC.W $01FF,$FFF0,$07,15 ;    size  5/16 pixels
                DC.W $01FF,$FFF8,$07,15 ;    size  6/16 pixels
                DC.W $01FF,$FFFC,$07,15 ;    size  7/16 pixels
                DC.W $01FF,$FFFE,$07,15 ;    size  8/16 pixels
                DC.W $01FF,$FFFF,$07,15 ;    size  9/16 pixels
                DC.W $01FF,$8000,$47,31 ;    size 10/16 pixels
                DC.W $01FF,$C000,$47,31 ;    size 11/16 pixels
                DC.W $01FF,$E000,$47,31 ;    size 12/16 pixels
                DC.W $01FF,$F000,$47,31 ;    size 13/16 pixels
                DC.W $01FF,$F800,$47,31 ;    size 14/16 pixels
                DC.W $01FF,$FC00,$47,31 ;    size 15/16 pixels

                DC.W $FF,$FF00,$48,31 ;    size 16/16 pixels ; 8 Shifts
                DC.W $FF,$FF80,$08,15 ;    size  1/16 pixels
                DC.W $FF,$FFC0,$08,15 ;    size  2/16 pixels
                DC.W $FF,$FFE0,$08,15 ;    size  3/16 pixels
                DC.W $FF,$FFF0,$08,15 ;    size  4/16 pixels
                DC.W $FF,$FFF8,$08,15 ;    size  5/16 pixels
                DC.W $FF,$FFFC,$08,15 ;    size  6/16 pixels
                DC.W $FF,$FFFE,$08,15 ;    size  7/16 pixels
                DC.W $FF,$FFFF,$08,15 ;    size  8/16 pixels
                DC.W $FF,$8000,$48,31 ;    size  9/16 pixels
                DC.W $FF,$C000,$48,31 ;    size 10/16 pixels
                DC.W $FF,$E000,$48,31 ;    size 11/16 pixels
                DC.W $FF,$F000,$48,31 ;    size 12/16 pixels
                DC.W $FF,$F800,$48,31 ;    size 13/16 pixels
                DC.W $FF,$FC00,$48,31 ;    size 14/16 pixels
                DC.W $FF,$FE00,$48,31 ;    size 15/16 pixels

                DC.W $7F,$FF80,$49,31 ;    size 16/16 pixels ; 9 Shifts
                DC.W $7F,$FFC0,$09,15 ;    size  1/16 pixels
                DC.W $7F,$FFE0,$09,15 ;    size  2/16 pixels
                DC.W $7F,$FFF0,$09,15 ;    size  3/16 pixels
                DC.W $7F,$FFF8,$09,15 ;    size  4/16 pixels
                DC.W $7F,$FFFC,$09,15 ;    size  5/16 pixels
                DC.W $7F,$FFFE,$09,15 ;    size  6/16 pixels
                DC.W $7F,$FFFF,$09,15 ;    size  7/16 pixels
                DC.W $7F,$8000,$49,31 ;    size  8/16 pixels
                DC.W $7F,$C000,$49,31 ;    size  9/16 pixels
                DC.W $7F,$E000,$49,31 ;    size 10/16 pixels
                DC.W $7F,$F000,$49,31 ;    size 11/16 pixels
                DC.W $7F,$F800,$49,31 ;    size 12/16 pixels
                DC.W $7F,$FC00,$49,31 ;    size 13/16 pixels
                DC.W $7F,$FE00,$49,31 ;    size 14/16 pixels
                DC.W $7F,$FF00,$49,31 ;    size 15/16 pixels

                DC.W $3F,$FFC0,$4A,31 ;    size 16/16 pixels ; 10 Shifts
                DC.W $3F,$FFE0,$0A,15 ;    size  1/16 pixels
                DC.W $3F,$FFF0,$0A,15 ;    size  2/16 pixels
                DC.W $3F,$FFF8,$0A,15 ;    size  3/16 pixels
                DC.W $3F,$FFFC,$0A,15 ;    size  4/16 pixels
                DC.W $3F,$FFFE,$0A,15 ;    size  5/16 pixels
                DC.W $3F,$FFFF,$0A,15 ;    size  6/16 pixels
                DC.W $3F,$8000,$4A,31 ;    size  7/16 pixels
                DC.W $3F,$C000,$4A,31 ;    size  8/16 pixels
                DC.W $3F,$E000,$4A,31 ;    size  9/16 pixels
                DC.W $3F,$F000,$4A,31 ;    size 10/16 pixels
                DC.W $3F,$F800,$4A,31 ;    size 11/16 pixels
                DC.W $3F,$FC00,$4A,31 ;    size 12/16 pixels
                DC.W $3F,$FE00,$4A,31 ;    size 13/16 pixels
                DC.W $3F,$FF00,$4A,31 ;    size 14/16 pixels
                DC.W $3F,$FF80,$4A,31 ;    size 15/16 pixels

                DC.W $1F,$FFE0,$4B,31 ;    size 16/16 pixels ; 11 Shifts
                DC.W $1F,$FFF0,$0B,15 ;    size  1/16 pixels
                DC.W $1F,$FFF8,$0B,15 ;    size  2/16 pixels
                DC.W $1F,$FFFC,$0B,15 ;    size  3/16 pixels
                DC.W $1F,$FFFE,$0B,15 ;    size  4/16 pixels
                DC.W $1F,$FFFF,$0B,15 ;    size  5/16 pixels
                DC.W $1F,$8000,$4B,31 ;    size  6/16 pixels
                DC.W $1F,$C000,$4B,31 ;    size  7/16 pixels
                DC.W $1F,$E000,$4B,31 ;    size  8/16 pixels
                DC.W $1F,$F000,$4B,31 ;    size  9/16 pixels
                DC.W $1F,$F800,$4B,31 ;    size 10/16 pixels
                DC.W $1F,$FC00,$4B,31 ;    size 11/16 pixels
                DC.W $1F,$FE00,$4B,31 ;    size 12/16 pixels
                DC.W $1F,$FF00,$4B,31 ;    size 13/16 pixels
                DC.W $1F,$FF80,$4B,31 ;    size 14/16 pixels
                DC.W $1F,$FFC0,$4B,31 ;    size 15/16 pixels

                DC.W $0F,$FFF0,$4C,31 ;    size 16/16 pixels ; 12 Shifts
                DC.W $0F,$FFF8,$0C,15 ;    size  1/16 pixels
                DC.W $0F,$FFFC,$0C,15 ;    size  2/16 pixels
                DC.W $0F,$FFFE,$0C,15 ;    size  3/16 pixels
                DC.W $0F,$FFFF,$0C,15 ;    size  4/16 pixels
                DC.W $0F,$8000,$4C,31 ;    size  5/16 pixels
                DC.W $0F,$C000,$4C,31 ;    size  6/16 pixels
                DC.W $0F,$E000,$4C,31 ;    size  7/16 pixels
                DC.W $0F,$F000,$4C,31 ;    size  8/16 pixels
                DC.W $0F,$F800,$4C,31 ;    size  9/16 pixels
                DC.W $0F,$FC00,$4C,31 ;    size 10/16 pixels
                DC.W $0F,$FE00,$4C,31 ;    size 11/16 pixels
                DC.W $0F,$FF00,$4C,31 ;    size 12/16 pixels
                DC.W $0F,$FF80,$4C,31 ;    size 13/16 pixels
                DC.W $0F,$FFC0,$4C,31 ;    size 14/16 pixels
                DC.W $0F,$FFE0,$4C,31 ;    size 15/16 pixels

                DC.W $07,$FFF8,$4D,31 ;    size 16/16 pixels ; 13 Shifts
                DC.W $07,$FFFC,$0D,15 ;    size  1/16 pixels
                DC.W $07,$FFFE,$0D,15 ;    size  2/16 pixels
                DC.W $07,$FFFF,$0D,15 ;    size  3/16 pixels
                DC.W $07,$8000,$4D,31 ;    size  4/16 pixels
                DC.W $07,$C000,$4D,31 ;    size  5/16 pixels
                DC.W $07,$E000,$4D,31 ;    size  6/16 pixels
                DC.W $07,$F000,$4D,31 ;    size  7/16 pixels
                DC.W $07,$F800,$4D,31 ;    size  8/16 pixels
                DC.W $07,$FC00,$4D,31 ;    size  9/16 pixels
                DC.W $07,$FE00,$4D,31 ;    size 10/16 pixels
                DC.W $07,$FF00,$4D,31 ;    size 11/16 pixels
                DC.W $07,$FF80,$4D,31 ;    size 12/16 pixels
                DC.W $07,$FFC0,$4D,31 ;    size 13/16 pixels
                DC.W $07,$FFE0,$4D,31 ;    size 14/16 pixels
                DC.W $07,$FFF0,$4D,31 ;    size 15/16 pixels

                DC.W $03,$FFFC,$4E,31 ;    size 16/16 pixels ; 14 Shifts
                DC.W $03,$FFFE,$0E,15 ;    size  1/16 pixels
                DC.W $03,$FFFF,$0E,15 ;    size  2/16 pixels
                DC.W $03,$8000,$4E,31 ;    size  3/16 pixels
                DC.W $03,$C000,$4E,31 ;    size  4/16 pixels
                DC.W $03,$E000,$4E,31 ;    size  5/16 pixels
                DC.W $03,$F000,$4E,31 ;    size  6/16 pixels
                DC.W $03,$F800,$4E,31 ;    size  7/16 pixels
                DC.W $03,$FC00,$4E,31 ;    size  8/16 pixels
                DC.W $03,$FE00,$4E,31 ;    size  9/16 pixels
                DC.W $03,$FF00,$4E,31 ;    size 10/16 pixels
                DC.W $03,$FF80,$4E,31 ;    size 11/16 pixels
                DC.W $03,$FFC0,$4E,31 ;    size 12/16 pixels
                DC.W $03,$FFE0,$4E,31 ;    size 13/16 pixels
                DC.W $03,$FFF0,$4E,31 ;    size 14/16 pixels
                DC.W $03,$FFF8,$4E,31 ;    size 15/16 pixels

                DC.W $01,$FFFE,$4F,31 ;    size 16/16 pixels ; 15 Shifts
                DC.W $01,$FFFF,$0F,15 ;    size  1/16 pixels
                DC.W $01,$8000,$4F,31 ;    size  2/16 pixels
                DC.W $01,$C000,$4F,31 ;    size  3/16 pixels
                DC.W $01,$E000,$4F,31 ;    size  4/16 pixels
                DC.W $01,$F000,$4F,31 ;    size  5/16 pixels
                DC.W $01,$F800,$4F,31 ;    size  6/16 pixels
                DC.W $01,$FC00,$4F,31 ;    size  7/16 pixels
                DC.W $01,$FE00,$4F,31 ;    size  8/16 pixels
                DC.W $01,$FF00,$4F,31 ;    size  9/16 pixels
                DC.W $01,$FF80,$4F,31 ;    size 10/16 pixels
                DC.W $01,$FFC0,$4F,31 ;    size 11/16 pixels
                DC.W $01,$FFE0,$4F,31 ;    size 12/16 pixels
                DC.W $01,$FFF0,$4F,31 ;    size 13/16 pixels
                DC.W $01,$FFF8,$4F,31 ;    size 14/16 pixels
                DC.W $01,$FFFC,$4F,31 ;    size 15/16 pixels

bl_shifttabs:
;
; This fairly complex table is meant to be used with an unsigned word offset:
;   High nibble 0..15 = Number of shifts   AND 15
;   Low nibble  0..15 = Size of the sprite AND 15
;     premultiplied by 8 because each line consists of 8 Bytes
;
; Word 0: ENDMASK 1, only depends on the number of shifts
; Word 1: ENDMASK 3, depends on both number of shifts and size
; Word 2: SKEW,      depends primarily on number of shifts and combination of both
; Word 3: Overflow,  depends on size+shifts, multiple of 16 plus rounding upwards
;                    If desired, it can help calculating the increment from line
;                    to line before dividing by 16-pixels-per-BLiT depending on
;                    NFSR and FXSR as defined in the word at offset 4
;
                DC.W $FFFF,$00,$00,15 ;    size 16/16 pixels ; 0 Shifts
                DC.W $8000,$00,$00,15 ;    size  1/16 pixels
                DC.W $C000,$00,$00,15 ;    size  2/16 pixels
                DC.W $E000,$00,$00,15 ;    size  3/16 pixels
                DC.W $F000,$00,$00,15 ;    size  4/16 pixels
                DC.W $F800,$00,$00,15 ;    size  5/16 pixels
                DC.W $FC00,$00,$00,15 ;    size  6/16 pixels
                DC.W $FE00,$00,$00,15 ;    size  7/16 pixels
                DC.W $FF00,$00,$00,15 ;    size  8/16 pixels
                DC.W $FF80,$00,$00,15 ;    size  9/16 pixels
                DC.W $FFC0,$00,$00,15 ;    size 10/16 pixels
                DC.W $FFE0,$00,$00,15 ;    size 11/16 pixels
                DC.W $FFF0,$00,$00,15 ;    size 12/16 pixels
                DC.W $FFF8,$00,$00,15 ;    size 13/16 pixels
                DC.W $FFFC,$00,$00,15 ;    size 14/16 pixels
                DC.W $FFFE,$00,$00,15 ;    size 15/16 pixels

                DC.W $7FFF,$8000,$41,31 ;    size 16/16 pixels ; 1 Shifts
                DC.W $4000,$00,$01,15 ;    size  1/16 pixels
                DC.W $6000,$00,$01,15 ;    size  2/16 pixels
                DC.W $7000,$00,$01,15 ;    size  3/16 pixels
                DC.W $7800,$00,$01,15 ;    size  4/16 pixels
                DC.W $7C00,$00,$01,15 ;    size  5/16 pixels
                DC.W $7E00,$00,$01,15 ;    size  6/16 pixels
                DC.W $7F00,$00,$01,15 ;    size  7/16 pixels
                DC.W $7F80,$00,$01,15 ;    size  8/16 pixels
                DC.W $7FC0,$00,$01,15 ;    size  9/16 pixels
                DC.W $7FE0,$00,$01,15 ;    size 10/16 pixels
                DC.W $7FF0,$00,$01,15 ;    size 11/16 pixels
                DC.W $7FF8,$00,$01,15 ;    size 12/16 pixels
                DC.W $7FFC,$00,$01,15 ;    size 13/16 pixels
                DC.W $7FFE,$00,$01,15 ;    size 14/16 pixels
                DC.W $7FFF,$00,$01,15 ;    size 15/16 pixels

                DC.W $3FFF,$C000,$42,31 ; size 16/16 pixels ; 2 Shifts
                DC.W $2000,$00,$02,15 ;    size  1/16 pixels
                DC.W $3000,$00,$02,15 ;    size  2/16 pixels
                DC.W $3800,$00,$02,15 ;    size  3/16 pixels
                DC.W $3C00,$00,$02,15 ;    size  4/16 pixels
                DC.W $3E00,$00,$02,15 ;    size  5/16 pixels
                DC.W $3F00,$00,$02,15 ;    size  6/16 pixels
                DC.W $3F80,$00,$02,15 ;    size  7/16 pixels
                DC.W $3FC0,$00,$02,15 ;    size  8/16 pixels
                DC.W $3FE0,$00,$02,15 ;    size  9/16 pixels
                DC.W $3FF0,$00,$02,15 ;    size 10/16 pixels
                DC.W $3FF8,$00,$02,15 ;    size 11/16 pixels
                DC.W $3FFC,$00,$02,15 ;    size 12/16 pixels
                DC.W $3FFE,$00,$02,15 ;    size 13/16 pixels
                DC.W $3FFF,$00,$02,15 ;    size 14/16 pixels
                DC.W $3FFF,$8000,$42,31 ; size 15/16 pixels

                DC.W $1FFF,$E000,$43,31 ; size 16/16 pixels ; 3 Shifts
                DC.W $1000,$00,$03,15 ;    size  1/16 pixels
                DC.W $1800,$00,$03,15 ;    size  2/16 pixels
                DC.W $1C00,$00,$03,15 ;    size  3/16 pixels
                DC.W $1E00,$00,$03,15 ;    size  4/16 pixels
                DC.W $1F00,$00,$03,15 ;    size  5/16 pixels
                DC.W $1F80,$00,$03,15 ;    size  6/16 pixels
                DC.W $1FC0,$00,$03,15 ;    size  7/16 pixels
                DC.W $1FE0,$00,$03,15 ;    size  8/16 pixels
                DC.W $1FF0,$00,$03,15 ;    size  9/16 pixels
                DC.W $1FF8,$00,$03,15 ;    size 10/16 pixels
                DC.W $1FFC,$00,$03,15 ;    size 11/16 pixels
                DC.W $1FFE,$00,$03,15 ;    size 12/16 pixels
                DC.W $1FFF,$00,$03,15 ;    size 13/16 pixels
                DC.W $1FFF,$8000,$43,31 ; size 14/16 pixels
                DC.W $1FFF,$C000,$43,31 ; size 15/16 pixels

                DC.W $0FFF,$F000,$44,31 ; size 16/16 pixels ; 4 Shifts
                DC.W $0800,$00,$04,15 ;    size  1/16 pixels
                DC.W $0C00,$00,$04,15 ;    size  2/16 pixels
                DC.W $0E00,$00,$04,15 ;    size  3/16 pixels
                DC.W $0F00,$00,$04,15 ;    size  4/16 pixels
                DC.W $0F80,$00,$04,15 ;    size  5/16 pixels
                DC.W $0FC0,$00,$04,15 ;    size  6/16 pixels
                DC.W $0FE0,$00,$04,15 ;    size  7/16 pixels
                DC.W $0FF0,$00,$04,15 ;    size  8/16 pixels
                DC.W $0FF8,$00,$04,15 ;    size  9/16 pixels
                DC.W $0FFC,$00,$04,15 ;    size 10/16 pixels
                DC.W $0FFE,$00,$04,15 ;    size 11/16 pixels
                DC.W $0FFF,$00,$04,15 ;    size 12/16 pixels
                DC.W $0FFF,$8000,$44,31 ; size 13/16 pixels
                DC.W $0FFF,$C000,$44,31 ; size 14/16 pixels
                DC.W $0FFF,$E000,$44,31 ; size 15/16 pixels

                DC.W $07FF,$F800,$45,31 ; size 16/16 pixels ; 5 Shifts
                DC.W $0400,$00,$05,15 ;    size  1/16 pixels
                DC.W $0600,$00,$05,15 ;    size  2/16 pixels
                DC.W $0700,$00,$05,15 ;    size  3/16 pixels
                DC.W $0780,$00,$05,15 ;    size  4/16 pixels
                DC.W $07C0,$00,$05,15 ;    size  5/16 pixels
                DC.W $07E0,$00,$05,15 ;    size  6/16 pixels
                DC.W $07F0,$00,$05,15 ;    size  7/16 pixels
                DC.W $07F8,$00,$05,15 ;    size  8/16 pixels
                DC.W $07FC,$00,$05,15 ;    size  9/16 pixels
                DC.W $07FE,$00,$05,15 ;    size 10/16 pixels
                DC.W $07FF,$00,$05,15 ;    size 11/16 pixels
                DC.W $07FF,$8000,$45,31 ; size 12/16 pixels
                DC.W $07FF,$C000,$45,31 ; size 13/16 pixels
                DC.W $07FF,$E000,$45,31 ; size 14/16 pixels
                DC.W $07FF,$F000,$45,31 ; size 15/16 pixels

                DC.W $03FF,$FC00,$46,31 ; size 16/16 pixels ; 6 Shifts
                DC.W $0200,$00,$06,15 ;    size  1/16 pixels
                DC.W $0300,$00,$06,15 ;    size  2/16 pixels
                DC.W $0380,$00,$06,15 ;    size  3/16 pixels
                DC.W $03C0,$00,$06,15 ;    size  4/16 pixels
                DC.W $03E0,$00,$06,15 ;    size  5/16 pixels
                DC.W $03F0,$00,$06,15 ;    size  6/16 pixels
                DC.W $03F8,$00,$06,15 ;    size  7/16 pixels
                DC.W $03FC,$00,$06,15 ;    size  8/16 pixels
                DC.W $03FE,$00,$06,15 ;    size  9/16 pixels
                DC.W $03FF,$00,$06,15 ;    size 10/16 pixels
                DC.W $03FF,$8000,$46,31 ; size 11/16 pixels
                DC.W $03FF,$C000,$46,31 ; size 12/16 pixels
                DC.W $03FF,$E000,$46,31 ; size 13/16 pixels
                DC.W $03FF,$F000,$46,31 ; size 14/16 pixels
                DC.W $03FF,$F800,$46,31 ; size 15/16 pixels

                DC.W $01FF,$FE00,$47,31 ; size 16/16 pixels ; 7 Shifts
                DC.W $0100,$00,$07,15 ;    size  1/16 pixels
                DC.W $0180,$00,$07,15 ;    size  2/16 pixels
                DC.W $01C0,$00,$07,15 ;    size  3/16 pixels
                DC.W $01E0,$00,$07,15 ;    size  4/16 pixels
                DC.W $01F0,$00,$07,15 ;    size  5/16 pixels
                DC.W $01F8,$00,$07,15 ;    size  6/16 pixels
                DC.W $01FC,$00,$07,15 ;    size  7/16 pixels
                DC.W $01FE,$00,$07,15 ;    size  8/16 pixels
                DC.W $01FF,$00,$07,15 ;    size  9/16 pixels
                DC.W $01FF,$8000,$47,31 ; size 10/16 pixels
                DC.W $01FF,$C000,$47,31 ; size 11/16 pixels
                DC.W $01FF,$E000,$47,31 ; size 12/16 pixels
                DC.W $01FF,$F000,$47,31 ; size 13/16 pixels
                DC.W $01FF,$F800,$47,31 ; size 14/16 pixels
                DC.W $01FF,$FC00,$47,31 ; size 15/16 pixels

                DC.W $FF,$FF00,$48,31 ; size 16/16 pixels ; 8 Shifts
                DC.W $80,$00,$08,15 ;    size  1/16 pixels
                DC.W $C0,$00,$08,15 ;    size  2/16 pixels
                DC.W $E0,$00,$08,15 ;    size  3/16 pixels
                DC.W $F0,$00,$08,15 ;    size  4/16 pixels
                DC.W $F8,$00,$08,15 ;    size  5/16 pixels
                DC.W $FC,$00,$08,15 ;    size  6/16 pixels
                DC.W $FE,$00,$08,15 ;    size  7/16 pixels
                DC.W $FF,$00,$08,15 ;    size  8/16 pixels
                DC.W $FF,$8000,$48,31 ; size  9/16 pixels
                DC.W $FF,$C000,$48,31 ; size 10/16 pixels
                DC.W $FF,$E000,$48,31 ; size 11/16 pixels
                DC.W $FF,$F000,$48,31 ; size 12/16 pixels
                DC.W $FF,$F800,$48,31 ; size 13/16 pixels
                DC.W $FF,$FC00,$48,31 ; size 14/16 pixels
                DC.W $FF,$FE00,$48,31 ; size 15/16 pixels

                DC.W $7F,$FF80,$49,31 ; size 16/16 pixels ; 9 Shifts
                DC.W $40,$00,$09,15 ;    size  1/16 pixels
                DC.W $60,$00,$09,15 ;    size  2/16 pixels
                DC.W $70,$00,$09,15 ;    size  3/16 pixels
                DC.W $78,$00,$09,15 ;    size  4/16 pixels
                DC.W $7C,$00,$09,15 ;    size  5/16 pixels
                DC.W $7E,$00,$09,15 ;    size  6/16 pixels
                DC.W $7F,$00,$09,15 ;    size  7/16 pixels
                DC.W $7F,$8000,$49,31 ; size  8/16 pixels
                DC.W $7F,$C000,$49,31 ; size  9/16 pixels
                DC.W $7F,$E000,$49,31 ; size 10/16 pixels
                DC.W $7F,$F000,$49,31 ; size 11/16 pixels
                DC.W $7F,$F800,$49,31 ; size 12/16 pixels
                DC.W $7F,$FC00,$49,31 ; size 13/16 pixels
                DC.W $7F,$FE00,$49,31 ; size 14/16 pixels
                DC.W $7F,$FF00,$49,31 ; size 15/16 pixels

                DC.W $3F,$FFC0,$4A,31 ; size 16/16 pixels ; 10 Shifts
                DC.W $20,$00,$0A,15 ;    size  1/16 pixels
                DC.W $30,$00,$0A,15 ;    size  2/16 pixels
                DC.W $38,$00,$0A,15 ;    size  3/16 pixels
                DC.W $3C,$00,$0A,15 ;    size  4/16 pixels
                DC.W $3E,$00,$0A,15 ;    size  5/16 pixels
                DC.W $3F,$00,$0A,15 ;    size  6/16 pixels
                DC.W $3F,$8000,$4A,31 ; size  7/16 pixels
                DC.W $3F,$C000,$4A,31 ; size  8/16 pixels
                DC.W $3F,$E000,$4A,31 ; size  9/16 pixels
                DC.W $3F,$F000,$4A,31 ; size 10/16 pixels
                DC.W $3F,$F800,$4A,31 ; size 11/16 pixels
                DC.W $3F,$FC00,$4A,31 ; size 12/16 pixels
                DC.W $3F,$FE00,$4A,31 ; size 13/16 pixels
                DC.W $3F,$FF00,$4A,31 ; size 14/16 pixels
                DC.W $3F,$FF80,$4A,31 ; size 15/16 pixels

                DC.W $1F,$FFE0,$4B,31 ; size 16/16 pixels ; 11 Shifts
                DC.W $10,$00,$0B,15 ;    size  1/16 pixels
                DC.W $18,$00,$0B,15 ;    size  2/16 pixels
                DC.W $1C,$00,$0B,15 ;    size  3/16 pixels
                DC.W $1E,$00,$0B,15 ;    size  4/16 pixels
                DC.W $1F,$00,$0B,15 ;    size  5/16 pixels
                DC.W $1F,$8000,$4B,31 ; size  6/16 pixels
                DC.W $1F,$C000,$4B,31 ; size  7/16 pixels
                DC.W $1F,$E000,$4B,31 ; size  8/16 pixels
                DC.W $1F,$F000,$4B,31 ; size  9/16 pixels
                DC.W $1F,$F800,$4B,31 ; size 10/16 pixels
                DC.W $1F,$FC00,$4B,31 ; size 11/16 pixels
                DC.W $1F,$FE00,$4B,31 ; size 12/16 pixels
                DC.W $1F,$FF00,$4B,31 ; size 13/16 pixels
                DC.W $1F,$FF80,$4B,31 ; size 14/16 pixels
                DC.W $1F,$FFC0,$4B,31 ; size 15/16 pixels

                DC.W $0F,$FFF0,$4C,31 ; size 16/16 pixels ; 12 Shifts
                DC.W $08,$00,$0C,15 ;    size  1/16 pixels
                DC.W $0C,$00,$0C,15 ;    size  2/16 pixels
                DC.W $0E,$00,$0C,15 ;    size  3/16 pixels
                DC.W $0F,$00,$0C,15 ;    size  4/16 pixels
                DC.W $0F,$8000,$4C,31 ; size  5/16 pixels
                DC.W $0F,$C000,$4C,31 ; size  6/16 pixels
                DC.W $0F,$E000,$4C,31 ; size  7/16 pixels
                DC.W $0F,$F000,$4C,31 ; size  8/16 pixels
                DC.W $0F,$F800,$4C,31 ; size  9/16 pixels
                DC.W $0F,$FC00,$4C,31 ; size 10/16 pixels
                DC.W $0F,$FE00,$4C,31 ; size 11/16 pixels
                DC.W $0F,$FF00,$4C,31 ; size 12/16 pixels
                DC.W $0F,$FF80,$4C,31 ; size 13/16 pixels
                DC.W $0F,$FFC0,$4C,31 ; size 14/16 pixels
                DC.W $0F,$FFE0,$4C,31 ; size 15/16 pixels

                DC.W $07,$FFF8,$4D,31 ; size 16/16 pixels ; 13 Shifts
                DC.W $04,$00,$0D,15 ;    size  1/16 pixels
                DC.W $06,$00,$0D,15 ;    size  2/16 pixels
                DC.W $07,$00,$0D,15 ;    size  3/16 pixels
                DC.W $07,$8000,$4D,31 ; size  4/16 pixels
                DC.W $07,$C000,$4D,31 ; size  5/16 pixels
                DC.W $07,$E000,$4D,31 ; size  6/16 pixels
                DC.W $07,$F000,$4D,31 ; size  7/16 pixels
                DC.W $07,$F800,$4D,31 ; size  8/16 pixels
                DC.W $07,$FC00,$4D,31 ; size  9/16 pixels
                DC.W $07,$FE00,$4D,31 ; size 10/16 pixels
                DC.W $07,$FF00,$4D,31 ; size 11/16 pixels
                DC.W $07,$FF80,$4D,31 ; size 12/16 pixels
                DC.W $07,$FFC0,$4D,31 ; size 13/16 pixels
                DC.W $07,$FFE0,$4D,31 ; size 14/16 pixels
                DC.W $07,$FFF0,$4D,31 ; size 15/16 pixels

                DC.W $03,$FFFC,$4E,31 ; size 16/16 pixels ; 14 Shifts
                DC.W $02,$00,$0E,15 ;    size  1/16 pixels
                DC.W $03,$00,$0E,15 ;    size  2/16 pixels
                DC.W $03,$8000,$4E,31 ; size  3/16 pixels
                DC.W $03,$C000,$4E,31 ; size  4/16 pixels
                DC.W $03,$E000,$4E,31 ; size  5/16 pixels
                DC.W $03,$F000,$4E,31 ; size  6/16 pixels
                DC.W $03,$F800,$4E,31 ; size  7/16 pixels
                DC.W $03,$FC00,$4E,31 ; size  8/16 pixels
                DC.W $03,$FE00,$4E,31 ; size  9/16 pixels
                DC.W $03,$FF00,$4E,31 ; size 10/16 pixels
                DC.W $03,$FF80,$4E,31 ; size 11/16 pixels
                DC.W $03,$FFC0,$4E,31 ; size 12/16 pixels
                DC.W $03,$FFE0,$4E,31 ; size 13/16 pixels
                DC.W $03,$FFF0,$4E,31 ; size 14/16 pixels
                DC.W $03,$FFF8,$4E,31 ; size 15/16 pixels

                DC.W $01,$FFFE,$4F,31 ; size 16/16 pixels ; 15 Shifts
                DC.W $01,$00,$0F,15 ;    size  1/16 pixels
                DC.W $01,$8000,$4F,31 ; size  2/16 pixels
                DC.W $01,$C000,$4F,31 ; size  3/16 pixels
                DC.W $01,$E000,$4F,31 ; size  4/16 pixels
                DC.W $01,$F000,$4F,31 ; size  5/16 pixels
                DC.W $01,$F800,$4F,31 ; size  6/16 pixels
                DC.W $01,$FC00,$4F,31 ; size  7/16 pixels
                DC.W $01,$FE00,$4F,31 ; size  8/16 pixels
                DC.W $01,$FF00,$4F,31 ; size  9/16 pixels
                DC.W $01,$FF80,$4F,31 ; size 10/16 pixels
                DC.W $01,$FFC0,$4F,31 ; size 11/16 pixels
                DC.W $01,$FFE0,$4F,31 ; size 12/16 pixels
                DC.W $01,$FFF0,$4F,31 ; size 13/16 pixels
                DC.W $01,$FFF8,$4F,31 ; size 14/16 pixels
                DC.W $01,$FFFC,$4F,31 ; size 15/16 pixels