How ca65 works

SNES game development, continued…


Just one more subject before we can actually get to write our SNES program. Using the assembler. You should have read some of the 6502 tutorials and read up on 65816 assembly basics… before heading any further.

First, we need to write our program in a text editor. I use Notepad++. You can use any similar app that can save a plain text file. We will save our files as .s or .asm. It might help if you include a path to the ca65 “bin” folder in environmental variables, so windows can find it. You can also just type a path in the command prompt, which will tell the system to look for ca65 and ld65 in the bin folder, which is one level up from the current directory.

set path=%path%;..\bin\

ca65 is a command line tool. If you just double click ca65, a box will open and then close. To run it, you need to first open a command prompt (terminal). To open a command prompt in Windows 10, you click on the address bar and type CMD. A black box should appear. You would type something like…

ca65 main.asm -g

for each assembly file. The -g means include the debugging symbols. If it assembles correctly, you should have .o (object files) of the same name. Then you use another program ld65 (the linker) to put them all together using a .cfg file as a map of how all the peices go together.

ld65 -C lorom256k.cfg -o program1.sfc main.o -Ln labels.txt

The -C is to indicate the .cfg filename (lorom256k.cfg). The -o indicates the output filename (program1.sfc). Then it lists all the object files (there is only 1, main.o). Finally, the -Ln labels.txt outputs the addresses of all the labels (for debugging purposes).

I use a batch file to automate the writes to the command line. Instead of opening a command prompt box, I just double click on the compile.bat file. I don’t want to go into detail about writing batch files, but mostly you will just need to add a ca65 line for each assembly file (unless they are “included” in the main assembly file, in which case they become part of that asm file). Then edit the ld65 line to include all object files.

Here’s some links to the ca65 and ld65 documents.

The example code (I will post it next time) has a .cfg file and some basic assembly files just to get to square one. There is some initial code, which zeroes the RAM and the hardware registers back to a standard state. We don’t want to touch that code. It works. Then there is a header section of the ROM so that emulators will know what kind of SNES file we have.

.segment “SNESHEADER”

.byte “ABCDEFGHIJKLMNOPQRSTU” ;rom name 21 chars
.byte $30 ;LoROM FastROM
.byte $00 ; extra chips in cartridge, 00: no extra RAM; 02: RAM with battery
.byte $08 ; ROM size (2^# kByte, 8 = 256kB)
.byte $00 ; backup RAM size
.byte $01 ;US
.byte $33 ; publisher id
.byte $00 ; ROM revision number
.word $0000 ; checksum of all bytes
.word $0000 ; $FFFF minus checksum

The checksum isn’t actually important. If it’s wrong, nothing bad will happen. The important line is the one that says “LoROM FastROM” after it.

And there are VECTORS here. The vectors are part of how the 65816 chip works. It is a table of addresses of important program areas. The reset vector is where we jump when the SNES is first turned on, or if the user presses RESET. There are some interrupt vectors like NMI and IRQ which we can discuss later. The important thing is that our reset vector points to the start of our init code, and that the end of the init code jumps to our main code. Also, our reset code MUST be in bank 00.

;ffe4 – native mode vectors
ABORT (not used)
RESET (not used in native mode)

;fff4 – emulation mode vectors
COP (not used)
(not used)
ABORT (not used)
RESET (yes!)


Let’s talk about the basic terminology of assembly files.


Foo = 62

They look like this. Foo is just a symbol that the assembler will convert to a number at compile time. It should go above the code that uses it.

LDA #Foo …becomes… LDA #62



There are 2 types of variables. BSS (standard) and Zero Page. On the SNES we call then Direct Page, but the assembler still calls them zero page. You have to put their definitions in a zeropage segment, which our linker file will specifically define as zeropage type.

.segment “ZEROPAGE”

temp1: .res 2

This reserves 2 bytes for the variable “temp1”.

.segment “BSS”

pal_buffer: .res 512

This reserves 512 bytes for a palette buffer. Our linker .cfg file will probably define the BSS segment to be in the $100-$1fff range.

Our code will go in a ROM / read-only type segment.

.segment “CODE”

LDA temp1

STA pal_buffer



  LDA #1
  STA $100

Main: is a label. It should be flush left in the line. To the assembler, Main is a number, an address in the ROM file. We could then jump to Main…
jmp Main
or branch to Main…
bra Main

One assembly file may not know the value of a label in another file. So we might need a .export Main in the file where Main lives, and a .import Main in the other file.



Also called opcodes. These are 3 letter mnemonics that the assembler converts to machine code. Some assemblers require whitespace to be on the left of the instructions (such as a tab or 2-3 spaces). I don’t believe ca65 requires this, but you might as well follow that standard practice.

  LDA cats
  AND #1
  ADC #$23
  STA cats
  JSR sleep



Use a semicolon ; to start a comment. The assembler will ignore anything after the semicolon. In the linker .cfg file, use # to start a comment.



These are commands that the assembler will understand.

.segment “blah”
.byte $12
.word $1234

segment tells the assembler that everything below this should go in the “blah” segment. 816 tell it that we are using a 65816 cpu. smart means automatically set the assembler to 8-bit or 16-bit depending on SEP and REP instructions. a16 sets the assembler to have 16-bit assembler instructions. a8 for 8-bit. i16 sets the assembler to have 16-bit index instructions. i8 for 8 bit. byte is to insert an 8-bit value into the ROM ($12 in this example). word is to insert a 16-bit value into the ROM ($34 then $12 in this example).

There are many other directives. Here are some important ones…

.include “filename.asm”

to include an assembly language file in another file.

.incbin “filename.chr”

to include a binary (ie. data) file in an assembly file. This example, CHR, is a graphics file.


65816 specific precautions

The most important thing to be careful with is register size. Your code needs REP and SEP commands to change the register size ( I use macros called A8, A16, XY8, XY16, AXY8, and AXY16). If you have .smart at the top of the code, the assembler will automatically adjust the assembly to the correct register size when it sees a REP or SEP that affect the register size flags… but, it is a good idea to put the explicit directives in at the top of each function. We need to make sure that the function above it doesn’t set the wrong register sizes. Those directives are .a8 .i8 .a16 and .i16.

Just to clarify– .a8 is an assembler directive to change the assembly output. A8 is a macro that will output a SEP #$20, which (when executed) will set the CPU into 8-bit Accumulator mode. .smart will see the SEP #$20 and automatically set the assembly output to 8-bit. But there are still possible errors, for example, something like this…

  A16 ;set A to 16 bit mode
  lda controller1
  and #KEY_B
  beq Next_Bit
  A8 ;set A to 8 bit mode
  lda #2
  sta some_variable

What do you think would happen? The assembler will think everything below A8 has the A register in 8 bit mode, including everything below Next_Bit, even though the beq could branch there with the processor still in 16 bit mode. This could crash or create unusual bugs. So, you should put an A16 directly after the Next_Bit label, to ensure registers are in a consistent size.

Also, you might want to bookend many of your subroutines with php (at the start) and plp (at the end) if the subroutine changes the processor size in any way. This will ensure that it returns safely from the subroutine with the exact processor size that it arrived with.

Alternatively, you could try to do have a consistent register size for most of your code. For example, keep the A register 8 bit and the XY registers 16 bit… or perhaps keep all register 16 bit for 90% of the code. An approach like that would reduce REP SEP changes and have fewer potential register size bugs.

If the subroutine changes any other registers (such as the data bank register B) you should also push that to the stack at the beginning of the subroutine and restore it at the end.

It is common to have data, and the code that manages that data, in the same bank. An easy way to set the data bank register to the same bank that the code is executing in is PHK (push program bank) then PLB (pull data bank). I have seen code that jumps to another bank do this, to save/restore the original data bank settings…

JSR code

But, maybe we don’t need to do that at EVERY subroutine. The overhead would be quite tedious and slow.


Let’s review the linker file. lorom256k.cfg.

# Physical areas of memory
ZEROPAGE: start = $000000, size = $0100;
BSS: start = $000100, size = $1E00;
BSS7E: start = $7E2000, size = $E000;
BSS7F: start = $7F0000, size =$10000;
ROM0: start = $808000, size = $8000, fill = yes;
ROM1: start = $818000, size = $8000, fill = yes;
ROM2: start = $828000, size = $8000, fill = yes;
ROM3: start = $838000, size = $8000, fill = yes;
ROM4: start = $848000, size = $8000, fill = yes;
ROM5: start = $858000, size = $8000, fill = yes;
ROM6: start = $868000, size = $8000, fill = yes;
ROM7: start = $878000, size = $8000, fill = yes;


# Logical areas code/data can be put into.
# Read-only areas for main CPU
CODE: load = ROM0, align = $100;
RODATA: load = ROM0, align = $100;
SNESHEADER: load = ROM0, start = $80FFC0;
CODE1: load = ROM1, align = $100, optional=yes;
RODATA1: load = ROM1, align = $100, optional=yes;
CODE2: load = ROM2, align = $100, optional=yes;
RODATA2: load = ROM2, align = $100, optional=yes;
CODE3: load = ROM3, align = $100, optional=yes;
RODATA3: load = ROM3, align = $100, optional=yes;
CODE4: load = ROM4, align = $100, optional=yes;
RODATA4: load = ROM4, align = $100, optional=yes;
CODE5: load = ROM5, align = $100, optional=yes;
RODATA5: load = ROM5, align = $100, optional=yes;
CODE6: load = ROM6, align = $100, optional=yes;
RODATA6: load = ROM6, align = $100, optional=yes;
CODE7: load = ROM7, align = $100, optional=yes;
RODATA7: load = ROM7, align = $100, optional=yes;

# Areas for variables for main CPU
ZEROPAGE: load = ZEROPAGE, type = zp, define=yes;
BSS: load = BSS, type = bss, align = $100, optional=yes;
BSS7E: load = BSS7E, type = bss, align = $100, optional=yes;
BSS7F: load = BSS7F, type = bss, align = $100, optional=yes;


The memory area defines several RAM areas. Then it defines 8 ROM areas ROM0, ROM1, etc. Notice they all start at xx8000 and are all $8000 bytes (32kB). This is typical for LoROM mapping. In LoROM, the ROM is always mapped to the $8000-FFFF area. The 0-7FFF area is almost always a mirror of this…

$0-1FFF LoRAM (mirror of 7e0000-7e1fff)

$2000-$4FFF Hardware registers

In LoROM, we have access to these almost all the time with regular addressing modes.

The alternative is called HiROM, which can have ROM banks extend from $0000-FFFF. This doubles the maximum size of ROM, but makes access to LoRAM and Hardware Registers more awkward. This tutorial won’t be using HiROM.

You might notice that the bank is $80 instead of $00. $80 is a mirror of $00 (they access the same memory), but $80+ has faster ROM accesses, whereas $00 are slower. (you also need to change a hardware setting in the $420d register, and should indicate FastROM type in the SNES header). The game will reset into the $00 bank, and we need to jump long to the $80 bank to speed it up slightly.

On a side note, a 256kB ROM size is actually unusually small. 512 and 1024 would be more standard, and you should be able to double the size of the test ROMs with no trouble. Just double the number of ROM banks. LoROM should be able to go up to 2048 kB also (I believe HiROM goes up to 4096 kB).


Ok. Some real code next time.

What you need, SNESdev

Before we start actually programming for the SNES, you will need a few things.

  1. An assembler
  2. A tile editor
  3. Photoshop or GIMP
  4. a text editor
  5. a good debugging emulator
  6. a tile arranging program
  7. a music tracker

65816 Assembler

I use ca65. It was designed for 6502, but it can assemble 65816 also. I am very familiar with it, and that is the main reason I use it. There is also WLA (which some other code examples and libraries use) and ASAR (which the people at SMWcentral use). For spc700 (which is another assembly language entirely) you could use the BASS assembler, by byuu.

(Click on Windows snapshot)

Why not use cc65 c compiler? It doesn’t produce 65816 assembly. The code generated is totally inapropriate. There is the tcc816 c compiler, which works with the PVSnesLib. It compiles to the WLA assembler. Frankly, I just didn’t feel like learning these tools. But they are here, if you are interested.

Link to the bass assembler.

These are command line tools. If you are not familiar with using command line tools, check out this link to catch up to speed. In windows 10, I have to click the address bar and type CMD (press enter) to open up a command line prompt. Watch a few of these tutorials to get the basics.

You might notice that I use batch files (compile.bat) to automate command line writes. You could use these or makefiles (which are a bit more complicated), to simplify the assembly process.


Tile Editor

I prefer YY-CHR for most of my graphics editing. For 16 color SNES, change the graphic format to “4bpp SNES…”. For 4 color SNES, change the graphic format to “2bpp GB”. The gameboy uses the same bitplane format as SNES.

The .NET version of YY-CHR has been improved, and can even do 8bpp SNES formats. Here’s the current link for the better version.

Another very good app is called superfamiconv. is a command line tool for converting indexed PNG (with no compression) to CHR files (snes graphic formats). it also makes palettes and map files. I don’t understand all what it can do, but you could use it to convert your pictures to SNES format without needing YY-CHR.

The command line options are a bit complex, but it really does a fantastic job.


Photoshop or GIMP

GIMP is sort of a free image tool like Photoshop. You can use any similar tool. If you convert to indexed color mode, reduce to 16 colors. You can cut and paste directly to YY-CHR in 4bpp SNES format. YY-CHR frequently screws up the indexing order, so likely you would need to use the color replace button to fix that.


Also, for 2bpp graphics, mode/indexed to 4 colors. Cut and paste to YY-CHR in 2bpp GB format.

Alternatively, you can save an indexed file to PNG (with NO compression), and then process it with Superfamiconv, which can also make maps and palettes. The palette in YY-CHR (3 byte RGB) is not at all like the SNES system palette (2 byte BGR), and won’t work interchangably. (see M1TE below). I actually think Superfamiconv does a much better job than any other method.

Or you could draw the graphics in a tile editor and skip GIMP altogether.


Text Editor

I use Notepad++. You could use any text editor, even plain old Notepad. You need to write your assembly source code with a text editor.


Debugging Emulator

I have used several emulators in the past. This year (2020) the emulator to use is MESEN-S. It is brand new, but it blows the other emulators away in terms of useful tools. There isn’t an official website yet, but download the most recent release from here…

It has a Debugger with disassembly and breakpoints. Event viewer. Hex editor / memory viewer. Register viewer. Trace logger. Assembler. Performance Profiler. Script Window. Tilemap Viewer. Tile Viewer. Sprite Viewer. Palette Viewer. SPC debugger. I might write an entire page just on this emulator. It’s cool.

One note, for a developer. Make sure when you rebuild a file, that you don’t select it from the picture that pops up when you open MESEN-S, but rather always select the file from File/Open. Otherwise, it will auto-load the savestate, which is the old file before it was reassembled.

I also like to change the keyboard input settings. For some reason he has mapped MULTIPLE settings at the same time, and none of them exactly what I would choose. So Option/Input/Setup, click on each Key Setup and clear them all (clear key bindings button) and then manually set a keyboard key for each button. I like the ASZX for YXBA buttons and arrow for direction pad.


Tile Arranger

If this was the NES tutorial, I would point you to download NES Screen Tool. Nothing like that exists for the SNES, so I have been trying to make my own tools. They are not 100% finished (still only 8×8 tiles), but they are at least usable. M1TE (mode 1 tile editor) is for creating backgound maps (and palettes and tiles). SPEZ is for creating meta sprites that work with my own code system (Easy SNES). You may not need SPEZ, but definitely download M1TE.

One main benefit of M1TE is palette editing and conversions. It can load a YY-CHR style palette and output a SNES format palette. And the reverse. Remember not to name the SNES palette file as the same name as your CHR file and .pal extension, or YY-CHR will auto-load it as a RGB palette, and fail.



Perhaps in the future I will make other modes… Mode 3 or Mode 7.

I also use Tiled Map Editor for creating data for games. You might find it useful, even if this tutorial won’t cover that.


Music Tracker

I have been working with the SNES GSS tracker and system written by Shiru. I have been told there was a bug in the code that causes games to freeze. You might want to download the tracker from my repo, which has been patched to fix the bug. (it’s the snesgssP.exe file).

and use the music.asm file here, since the original was written to work with tcc-816 and WLA.

I think I got the original from here.

…although it was written by Shiru, and not this gentleman.


You may want to use another music system. This one can NOT use brr samples from any other source. It can only make it’s own samples from 16 bit mono WAV files. I haven’t tried any other trackers / music drivers yet, so I’m not an expert.


I think that’s enough for today. Next time, we can discuss using the ca65 assembler.


Further in 65816

I wrote some 6502 ASM tutorials a while back.

26. ASM Basics

Feel free to check them out (5 pages total). You can test various things with this online 6502 emulator…

All the information here will transfer perfectly toward 65816 programming. Stay on this until you understand it, before moving on to any more.

Or, if you prefer video tutorials…


Opcodes References 6502


Quick Explanation of 65816 ASM

I will just cover some basics, and then mention the differences between 6502 and 65816.

Data transfer.

You need to load data to a register to move it. Any of the registers can do this.

LDA $1000 ; load A from address $1000
STA $800 ; store/copy A to address $800

LDX $1000 ; load X from address $1000
STX $800 ; store/copy X to address $800

LDY $1000 ; load Y from address $1000
STY $800 ; store/copy Y to address $800

and, depending on register size this would move 1 byte or 2. If it moved 2 bytes, it would get the lower byte from $1000 and the upper byte from $1001.

Note, you can write comments in ASM with a ; semicolon. Everything after the semicolon is ignored by the assembler.


Addressing modes.

Depending on how the LDA is written in assembly, you can perform multiple kinds of operations.

Direct Page

(similar to the zero page from 6502)

LDA $12 – load A from the direct page address $12. If direct page register is $0000 this will load A from $000012 (direct page is always in the $00 bank).


LDA $1234 – loads A from the address $1234, in the bank defined by the Data Bank Register. If the Data Bank is $00… will load A from $001234.

Absolute Long

LDA $123456 – loads A from address $3456 in the $12 bank.


LDA #$12 – loads A with the value $12. Always needs a preceeding #. Might be an 8 bit or a 16 bit value depending on the mode of A.

Direct Page Indexed

Indexed modes are for arrays of bytes, using index registers. Direct page is always in bank zero.

LDA $12, X – same as direct page, but the X register is added to the address number. If X is $10, this would load A from the address $22.

LDA $12, Y – same, but the Y register is added to the address number.

(X and Y are NOT restricted to 8 bit, and can extend $ffff bytes forward without wraparound, except that the final address bank will be $00 for direct page)

Absolute Indexed

LDA $1234, X – same as absolute, but the X register is added to the address number.

LDA $1234, Y – same as absolute, but the Y register is added to the address number.

(X and Y don’t wrap, and if address + X > $ffff it will temporarily increase the data bank byte to extend into the next bank. This is true of every indexed mode except for the direct page indexed.)

Absolute Indexed Long 

LDA $123456, X – same as absolute long, but the X register is added to the address number. (only X can do this mode)


This is how pointers work on the 6502 (65816) CPU. The pointer is loaded to 2 consecutive direct page addresses.

LDA ($12) – $12 is an address in the Direct Page. It takes a byte from $12 (lower byte) and $13 (upper byte) to construct an address, then the bank byte from data bank register, and then loads from that address. If $12 = $00 and $13 = $80, then this would load A with the value at address $018000 (if the data bank is $01).

Indirect Long

Like Indirect, but 3 consecutive bytes are stored in the Direct Page to construct a long address. Low byte, High byte, then Bank byte.

LDA [$12] – If $12 = $00 and $13 = $80 and $14 = $02, loads A from the value at address $028000.

Indirect, Y

LDA ($12), Y – same as Indirect, but the indirect address is added to the Y register to get a final address to load to A from.

Indirect Long, Y

LDA [$12], Y – same as Indirect Long, but the indirect long address is added to the Y register to get a final address to load to A from.

Indirect, X

This is for an array of pointers. Each pointer (2 bytes each) is in the Direct Page, and you will need to increase X by 2 to switch between them.

LDA ($12, X) – Let’s say X is 2, so we don’t want to look at $12 and $13, but rather $14 and $15. $14 = 00, $15 is $80, and the data bank is $01. This will load A with the value at address $018000.



Changes in the 65816 (from 6502)

** If you don’t understand all these things, don’t worry. You can always come back to it later, as these things come up. I frequently have to check the WDC manual to be reminded of all the details of each instruction, and I’ve been doing this 10 years. **

Zero page has been replaced with direct page, which is movable by changing the DP register. Just keep it $0000 for most purposes.

The hardware stack is no longer fixed. It can be any address in the zero bank. (on the SNES should be set at $1fff at the start of the program).

The A, X, and Y registers can be 8 or 16 bits. See SEP / REP below.

Many operations can now be 8 or 16 bytes depending on the size of the A register. ADC, AND, ASL, BIT, CMP, DEC, EOR, LDA, LSR, ORA, PHA, PLA, ROL, ROR, SBC, STA, STZ, TRB, and TSB… are all dependent on the size of the A register.

BRK has its own vector. Could be used for software purposes or debugging.



Long addressing

(can’t be Y)
ADC long
ADC long, X
AND long
AND long, X
CMP long
CMP long, X
EOR long
EOR long, X
JMP long aka JML
JSR long aka JSL (also RTL return long)
LDA long
LDA long, X
ORA long
ORA long, X
SBC long
SBC long, X
STA long
STA long, X

Store Zero

Stores zero at an address without changing A. (1 or 2 bytes depending on size of A)
STZ dp
STZ dp, X
STZ absolute
STZ absolute, X
(can’t do long)


BRA branch always
BRL branch always long (2 bytes, signed)
(don’t use BRL, just do JMP. BRL is for a system that might load a program anywhere in the RAM, relocatable code. Not really for the SNES.)

also new…

JMP (absolute, X) . for an array of funtion pointers (a jump table) in the direct page, using X to switch between the different indirect jump addresses.

JMP [absolute] . jump indirect long. Like the JMP (absolute) indirect jump instruction, but 3 bytes long to make a long address to jump to.


now available for the A register.

dec A . is the same as A = A – 1
inc A . is the same as A = A + 1

Indirect with or without Y Index

(dp means that the pointer needs to be located in the direct page)
ADC (dp) . . ADC (dp), Y
AND (dp) . . AND (dp), Y
CMP (dp) . . CMP (dp), Y
EOR (dp) . . EOR (dp), Y
LDA (dp) . . LDA (dp), Y
ORA (dp) . . ORA (dp), Y
SBC (dp) . . SBC (dp), Y
STA (dp) . . STA (dp), Y

Indirect Long and Indirect Long Indexed

With or without Y indexing

ADC [dp] . . ADC [dp], Y
AND [dp] . . AND [dp], Y
CMP [dp] . . CMP [dp], Y
EOR [dp] . . EOR [dp], Y
LDA [dp] . . LDA [dp], Y
ORA [dp] . . ORA [dp], Y
STA [dp] . . STA [dp], Y
SBC [dp] . . SBC [dp], Y


To set register size, we use REP or SEP (reset processor flag, set processor flag).
REP #$20 set A 16 bit
SEP #$20 set A 8 bit
REP #$10 set XY 16 bit
SEP #$10 set XY 8 bit
or combine them…
REP #$30 set AXY 16 bit
SEP #$30 set AXY 8 bit

(REP and SEP can be used to change other processor status flags).

(note the # for immediate addressing)

Transfers between registers.

now include
TXY – transfer x to y
TYX – transfer y to x
TCS – transfer A register to stack pointer
TSC – transfer stack pointer to A register

Size mismatch from transfers between A and index registers X or Y. Think about the destination size, that will tell you how many bytes will transfer.
A8 -> X16 or Y16 transfers 2 bytes, remember that A in 8 bit, the high bit exists
A16 -> X8 or Y8 transfers 1 byte
X8 or Y8 -> A16 transfers 2 bytes, and the upper byte of A is zeroed. XY in 8 bit always have zero as their upper byte.
X16 or Y16 -> A8 transfers 1 byte, the upper byte of A unchanged

Stack Relative

You would push variables to the stack before calling a jsr or jsl.
The stack pointer is always points to 1 less than the last value pushed, so start from 1. If JSR to a function, then add 2 more. If JSL to a function then add 3 more.

ADC sr, S
AND sr, S
CMP sr, S
EOR sr, S
LDA sr, S
ORA sr, S
SBC sr, S
STA sr, S

Example… STA 1, S

Stack Relative Indirect

Push a pointer to an array to the stack. Index that array with Y.
ADC (sr, S), Y
AND (sr, S), Y
CMP (sr, S), Y
EOR (sr, S), Y
LDA (sr, S), Y
ORA (sr, S), Y
SBC (sr, S), Y
STA (sr, S), Y

Block Moves

To copy a chunk of bytes from one memory area to another. MVN Block Move Next and MVP Block Move Previous.

You are supposed to use MVN to move from a lower address to a higher one, and MVP from a higher address to a lower. For MVN, X holds the start address of src and Y holds the start address of dest, and A (always 16 bit, regardless of size of A) holds the # of bytes to transfer minus 1. For MVP, X holds the end address of the src block and Y holds the end address of the dest block.

Just use MVN, it’s easier to use.

The byte order in the binary is opposite of what the standard syntax indicates, so I tend to use a macro to handle this, because it’s confusing. And there was a change in ca65 source code which reverses the order, so code will break if you use the wrong version of ca65 (grumble).

MVN src bank, dest bank
MVP src bank, dest bank

The registers should be 16 bit before using MVN or MVP. Also, they have an annoying issue, where they will overwrite the data bank register, so it is probably a good idea to push that register to the stack before MVN/MVP and restore it (pull it from the stack) after the MVN/MVP procedure.

Push to stack

PEA address
PEI (dp)
PER relative-address

PEA which is called push effective “address”, but it really just pushes a 16 bit value to the stack without using a register. It doesn’t have to be an address. It is very useful for any 16 bit immediate push to the stack. You don’t need to change a register size either, it always pushes a 16 bit value.

PEI pushes a value stored on the direct page (16 bit) to the stack.

PER was designed for a computer system that can load a program anywhere in the RAM and run it…relocatable code. It isn’t really useful for SNES. It pushes a value from the same bank, in a 16 bit relative distance from this instruction. You could use stack relative or pull it to a register from after pushing the value or address to the stack.

Pushing / pulling the new registers

PHB – push data bank register to stack
PHD – push direct page register to stack
PHK – push program bank register to stack
PHX – push X register to stack
PHY – push Y register to stack
PLB – pull from stack to data bank register
PLD – pull from stack to direct page register
PLX – pull from stack to X register
PLY – pull from stack to Y register

Transfers with A

(always copies 16 bits regardless of size of A)
TCD – transfer from A to direct page register
TCS – transfer from A to stack pointer
TDC – transfer from direct page register to A
TSC – transfer from stack pointer to A

Test and Set Bits / Test and Reset Bits

TRB dp
TRB address
TSB dp
TSB address

TRB, test and reset bits. A register (8 or 16 bits) has the bits to change. If a bit in A is 1 it will be zeroed at the address location. If a bit in A is 0 it remains unchanged.

TSB, test and set bits. A register (8 or 16 bits) has the bits to change. If a bit in A is 1 it will be set (1) at the address location. If a bit in A is 0 it remains unchanged.

There is also an testing operation, as if the value in A was ANDed with the address, and the z flag is set if A AND value at address would equal zero. Unrelated to the setting or resetting operation.


COP – jump to COP vector (for a coprocessor routine)
XBA – swap high and low bytes of A (works even if A is 8 bit)
XCE – move carry to CPU mode (emulator or native modes)
STP – stops the CPU, only reset will start it again. Don’t use this.
WAI – wait till interrupt, halts the CPU until IRQ or NMI trigger.
WDC # – nothing, but useful for debugging. Followed by a number, which could be used to locate where you are in the code (in a debugger).


Some more links, to other descriptions of 65816 ASM

And the WDC manual, again, for reference. It’s very big, and explains everything.




65816 Basics

Programming the SNES in assembly, using the ca65 assembler.

Assembly Language Basics

Assembly is a low level programming language. We have to think at the basic level that the CPU processes the binary code. Let’s review binary, and hexadecimal numbers.

Number Systems

Binary. Under the hood, all computers process binary numbers. A series of 1s and 0s. In the binary system, each column is 2x the value of the number to the right.

0001 = 1
0010 = 2
0100 = 4
1000 = 8

You then add all the 1’s up

0011 = 2+1 = 3
0101 = 4+1 = 5
0111 = 4+2+1 = 7
1111 = 8+4+2+1 = 15

Each of these digits is called a bit. Typically, there are 8 bits in a byte. So you can have numbers from
0000 0000 = 0
1111 1111 = 255

Since it is difficult to read binary, we will use hexadecimal instead. Hexadecimal is a base 16 numbering system. Every digit is 16x the number to the right. We use the normal numbers from 0-9 and then letters A-F for the values 10,11,12,13,14,15. In many assembly languages, we use $ to indicate hex numbers.

$0 = 0
$1 = 1
$2 = 2
$3 = 3
$4 = 4
$5 = 5
$6 = 6
$7 = 7
$8 = 8
$9 = 9
$A = 10
$B = 11
$C = 12
$D = 13
$E = 14
$F = 15

$F is the same as binary 1111.

The next column of numbers is multiples of 16.

$00 = 16*0 = 0 _____ $80 = 16*8 = 128
$10 = 16*1 = 16 _____ $90 = 16*9 = 144
$20 = 16*2 = 32 ____ $A0 = 16*10 = 160
$30 = 16*3 = 48 ____ $B0 = 16*11 = 176
$40 = 16*4 = 64 ____ $C0 = 16*12 = 192
$50 = 16*5 = 80 ____ $D0 = 16*13 = 208
$60 = 16*6 = 96 ____ $E0 = 16*14 = 224
$70 = 16*7 = 112 ____ $F0 = 16*15 = 240

$F0 is the same as binary 1111 0000.
add that to $0F (0000 1111) to get
$FF = 1111 1111

So you see, you can represent 8 bit binary numbers with 2 hex digits. From $00 to $FF (0 – 255).

To get the assembler to output the value 100 you could write…

.byte 100


.byte $64


16 bit numbers

Typically (on retro systems) you use 16 bit numbers for memory addresses. Memory addresses are locations where pieces of information can be stored and read later. So, you could write a byte of data to address $1000, and later read from $1000 to get that data.

The registers on the SNES can be set to either 8 bit or 16 bit modes. 16 bit mode means it can move information 16 bits at a time, and process the information 16 bits at a time. 16 bit registers means that it will read a byte from an address, and another from the address+1. Same with writing 16 bits. It will write (low order byte) to the address and (high order byte) to address+1.

In binary, a 16 bit value can go from
0000 0000 0000 0000 = 0
1111 1111 1111 1111 = 65535

In hex values, that’s $0000 to $FFFF.

Let’s say we have the value $1234. The 12 is the most significant byte (MSB), and the 34 is the least significant byte (LSB). To calculate it’s value by hand we can multiply each column by multiples of 16.

4 x 1 = 4
3 x 16 = 48
2 x 256 = 512
1 x 4096 = 4096
4096 + 512 + 48 + 4 = 4660

To output a 16 bit value $ABCD, you could write

.word $ABCD
(outputs $cd then $ab, little endian style)

Don’t forget the $.

We can also get the upper byte or lower byte of a 16 bit value using the < and > symbols before the value.

Let’s say label List2 is at address $1234

.byte >List2
will output a $12 (the MSB)

.byte <List2
will output a $34 (the LSB).


24 bit numbers

We can now access addresses beyond $ffff. There is a byte above that called the “bank byte”. Using long 24 bit addressing modes or changing the data bank register, we can access values in that bank using regular 16 bit addressing. Here is an example of a 24 bit operation.

LDA f:$7F0000
will read a byte from address $0000 of the $7F bank (part of the WRAM).

In ca65, the f: is to force 24 bit values from the symbol / label. The assembler will calculate the correct values. (to force 16 bit you use a: and to force 8 bit you use z:)

JML $018000
will jump to address $8000 in bank $01.


To output a 24 bit value
.faraddr $123456
(outputs $56…$34…$12)

Or you could do this, to output a byte at a time.
.byte ^$123456
(outputs $12)
.byte >$123456
(outputs $34)
.byte <$123456
(outputs $56)

But we don’t want to write our program entirely using byte statements. That would be crazy. We will use assembly language, and the assembler will convert our three letter mnemonics into bytes for us.

LDA #$12
(load the A register with the value $12)

will be converted by the assembler into this machine code that the 65816 CPU can execute…

$A9 $12


65816 CPU Details

There are 3 registers to work with

A (the accumulator) for most calculations and purposes

X and Y (index registers) for accessing arrays and counting loops.

A,X, and Y can be set to either 8 bit or 16 bit. The accumulator is sometimes called C when it is in 16 bit mode. Setting the Accumulator to 8 bit does not destroy the upper byte, you can access it with XBA (swap high and low bytes). However, setting the Index registers to 8 bit will delete the upper bytes of X and Y.

There is a 16-bit stack pointer (SP or S) for the hardware stack. If you call a function (subroutine) it will store the return address on the stack, and when the function ends, it will pop the return address back to continue the main program. The stack always exists on bank zero (00). The stack grows downward, as things are added to it.

Processor Status Flags (P), are used to determine if a value is negative, zero, greater/lesser/equal to, etc. Used to control the flow of the program, like if/then statements. Also the register size (8 bit or 16 bit) are set/reset as status flags. *(see below)

There is a 16-bit direct page (DP) register, which is like the zero page on the 6502 system, except that it is now movable. Typically, people leave it set to $0000 so that it works the same as the 6502. Zero page is a way to reduce ROM size, by only using 1 byte to refer to an address. The DP always exists on bank zero (00).

The Program Bank Register (PBR or K) is the bank byte (highest byte) of the 24 bit address of where the program is running. Together, with the program counter (PC) the CPU will execute the program at this location. The PBR does NOT increment when the PC overflows from FFFF to 0000, so you can’t have code that flows from one bank to another. You can’t directly set the PBR, but jumping long will change it, and you can push it to the stack to be used by the…

Data Bank Register (DBR or B) is the bank byte (highest byte) of the 24 bit address of where absolute addressing (16 bit) reads and writes. Usually you want to set it to the same as where your program is running. You do it with this…

PHK (push program bank to stack)
PLB (pull from stack to data bank)

But you can also set it to another bank, to use absolute addressing to access that bank’s addresses.

There is also a hidden switch to change the processor from Native Mode (all 65816 functions) to Emulation Mode (compatibility for legacy 6502 software, with direct page fixed to $0000-00ff, stack fixed to $0100-01ff, registers fixed to 8 bit only). The CPU powers on in Emulation Mode, so you will usually see

CLC (clear the carry flag)
XCE (transfer carry flag to CPU mode)

near the start, to put it in Native Mode. That’s what we want, native mode.


Status Flags

– – – B – – – – (emulation mode only)

N negative flag, set if an operation sets the highest bit of a register
V overflow flag, for signed math operations
M Accumulator size, set for 8-bit, zero for 16-bit
X Index register size, set for 8-bit, zero for 16-bit
D decimal flag, for decimal (instead of hexadecimal) math
I IRQ disable flag, set to block IRQ interrupts
Z zero flag, set if an operation resets a register to zero
. . . . or if a comparison is equal
C carry flag, for addition/subtraction overflow

B break flag, if software break BRK used.


Where does the program start? It always boots in bank zero, in emulation mode, and pulls an address (vector) off the Emulation Mode Reset Vector located at $00FFFC and $00FFFD, then jumps to that address (always jumping to bank zero). Your program should set it to Native Mode, after which these are the important vectors.

IRQ $00FFEE-00FFEF (interrupt vector)
NMI $00FFEA-00FFEB (non-maskable interrupt vector)

If an interrupt happens, it will jump to the address located here (always jumping to bank zero).

There is no Reset Vector in Native Mode. Hitting reset will automatically put it back into Emulation Mode, and it will use that Reset Vector.

But more on those later.

I highly recommend you learn more about 6502 assembly before continuing. Here are some links that are helpful.

and 65816 assembly reference here.

and for the very bold, the really really big detailed book on the subject. You might want to download it just for reference.

Click to access wdc_65816_programming_manual.pdf


SNES Overview


The Super Nintendo first came out in 1991 (1990 in Japan as the Super Famicom). It was one of the best game consoles of all time, with an amazing library of games. It definitely has a special place in my heart. Before we get to programming for it, let’s take a look and see what’s under the hood.

It contains a 65816 clone chip called the Ricoh 5A22 (3.58 MHz), which is a direct descendant of the 6502 chip that powers the original Nintendo. They call it a super-set because essentially all the 6502 codes work the same on both chips. The chip has both 8 bit and 16 bit modes, and has a full 24 bit addressing space (0 to 0xFFFFFF).

(borrowed this image from )


It has 128kB internal WRAM.

This unit has 2 PPU chips. Later models had 1 chip (marked as “1chip”) and are slightly better, but both function the same, other than picture quality. The PPU is what generates the video image. It has its own 64 kB of VRAM (arranged as 32 kB addresses of 16 bits). The VRAM holds our graphics and background maps.

Background tiles can be 8×8 or 16×16. Background maps can be 32×32 up to 64×64 (multiplied by the tile size). Giving from 256×256 pixel map to 1024×1024 sized map. Scrolling games will typically change the map, a column or row at a time, as you move through the level.

There is also a memory chip for color palettes (CGRAM) of 512 bytes, and a memory chip for sprite attributes (OAM) of 544 bytes, arranged as a low table of 512 and a high table of 32. Color palettes are 15 bits per color RBG 0bbbbbgggggrrrrr, so 512 bytes gives us a total of 256 colors.

Sprites are 4 bytes (and 2 extra bits) each, so 512 byte / 4 gives us 128 sprites displayable at once. Sprites can be various sizes, from 8×8 to 64×64. Typically they would be 8×8 or 16×16.

The CIC is the security chip. Each cartridge will need a matching chip to run.

The cartridge itself could contain battery backed SRAM for saving games.

On the other side of the motherboard are the audio chips. Together they are called S-SMP. And it is a system that runs independently from the CPU. It has it’s own processor, the SPC700. It has it’s own 64 kB of RAM where you load the audio program, the song data, and the audio samples.

The audio program will then process the song and set the 8 channels (DSP, digital signal processor) to play the different audio samples at different rates to generate different tones. The samples are compressed in a format called BRR.

When the SNES is switched on, the main program will have to load everything to the audio ram. This is actually a very slow process, and if you notice that games show a black screen for a few seconds when the game is first run, it is probably due to loading the sound RAM. Once the audio program is loaded, it will run automatically. The audio program can also get signals from the game (for example, to change songs, or to play a sound effect).


Graphics Modes

If you are familiar with the original Nintendo, the Super Nintendo works very similar except that everything is changeable. Backgrounds can have small tiles, large tiles, small maps, large maps, 1 layer, 2 layers, 3 layers, 4 layers, all moving independently.

You can rearrange the VRAM any way you want. You can put maps first then BG tiles then sprite tiles. Or, BG tiles first then sprite tiles then BG maps. Or, you can have one giant 256 color BG that fills the entire VRAM and no sprite tiles. Any way you want.

Here’s a quick look at the different background modes.

Mode 0

4 layers of 4 colors per tile.

This mode is not as colorful. I generally only see it used for screens of text. The main advantages are the extra layer, and the graphics files are half the usual size. This mode is the most like the original Nintendo (or maybe Gameboy Color), and a game ported from there to the Super Nintendo (with no improvements) might use mode 0.

Mode 1

2 layers of 16 colors per tile.

1 layer of 4 colors per tile.

This is the most used mode. Nearly every game uses mode 1 most of the time. Typically, the first 16 color layer for foreground and other for background. Then the 4 color layer is used for text boxes or HUD / score display.


Mode 2

2 layers of 16 color. Each tile can be shifted. See Tetris Attack, how the play area scrolls upward. This mode is very rarely used. Yoshi’s Island uses it for 1 level (Touch Fuzzy Get Dizzy).


Mode 3

1 layer of 256 colors

1 layer of 16 colors

This mode is very colorful, but the graphics files are twice as big as usual, so games typically only used them (if at all) for title screens. Like this…

Aero the Acro-Bat (USA)_000

Mode 4

1 layer of 256 colors

1 layer of 4 colors

Similar to mode 2, tiles can be shifted. Rarely used mode.

Mode 5

1 layer of 16 colors

1 layer of 4 colors

This mode is a high resolution mode. It is rarely used. RPM Racing uses it.


Mode 6

1 layer of 16 color

Also a high resolution mode. Also can shift tiles like mode 2 or 4. I’ve never seen this used outside of demos.

Mode 7

This is the mode that really set Super Nintendo apart from other consoles. This mode is one layer that can zoom in and out and rotate the background. This mode is completely different from the other modes and has it’s own special graphics format that is hard to explain. There is one big tilemap of 128×128, 1 layer of 256 colors.


Interestingly, mode 7 does not naturally do perspective. It can stretch and rotate only. But, games like F-zero change the stretching parameters line by line to simulate perspective. As the PPU generates the image sent to the TV, it renders each horizontal line, one at a time, and the BG image is zoomed in.

Really, all the modes can change parameters line by line. One fairly common technique is to change the main background color to create a gradient effect. It uses HDMA to do this, which is the only way to send data to the VRAM (or in this case, CGRAM) during the active screen time.



For all the modes, the way sprites work is always the same. Sprites are always 16 colors (actually 15 since the 0th color always transparent). Also, sprites have different size modes (8×8, 16×16, 32×32, 64×64), and different priorities (which work like layers). You can flip them horizontally or vertically.

Like the NES, there is a limit of how many sprites can be displayed on each horizontal line, and the calculus for that is a bit complicated. If you split each large sprite into 8×1 slices, you can only fit 32 slices on a horizontal line. The 33rd will be invisible. Generally this isn’t a problem, but you should be aware of it.

All the characters on the screen are sprites. Mario. Koopas. Etc. Because sprites can be large and you can fill the screen with them, and move them around easily, sprites can be used for background elements and be considered as another background layer. Title screens often have the title written in sprites, for example. And, mode 7 games use sprites as a second layer.


Other possibilities

You can change modes and settings mid screen.

You can change scrolling mid screen, and create a shifting sine wave pattern, perhaps for fire or underwater scene.

You can do color transparencies. One layer added to another (or subtracted from another).

You can use “windows” to cut a hole in the screen, or narrow the screen from the sides, or even adjust the windows line by line with HDMA and draw shapes on the screen.

All these effects are more advanced, and we shouldn’t worry too much about that stuff yet.

I plan to write some basic tutorials for getting a very simple game working on the Super Nintendo. But, we need to take it very slowly.




SNES Programming Guide

(by the way, M1TE has been updated, ver. 1.6 and SPEZ is now ver 1.2. May 27,2020. Links below.)

SNES programming pages:

SNES Overview

65816 Basics

Further in 65816

What you need, SNESdev

How ca65 works

SNES Example 1

DMA palette


Layers / Priority


Controllers and NMI


North to the Future. The next big thing is SNESdev.

But the tools I have become used to don’t exist… so I am making my own. In place of NES screen tool, I have made 2 similar apps for the SNES.

M1TE (mode 1 tile editor) for backgrounds and SPEZ for sprites. *** THESE ARE NOT ROM HACKING TOOLS, THEY WILL NOT WORK WELL AT MODIFYING EXISTING SNES / SFC GAMES ***




And instead of neslib for cc65, I have made easySNES library for ca65 (assembly only).

See all these projects here…

(example code with easySNES)

Other projects that are worth looking into… Optiroc.

(converts images to SNES tile and map and palette format)

(another SNES development library)

(the ultimate debugging SNES emulator)




My projects will likely improve over time, so keep an eye out for updates.

I plan to write some examples, and eventually make an SNES homebrew game. Stay tuned.




24. Advanced Mapper – MMC3

MMC3 is the 2nd most popular mapper. Many of the most popular games were MMC3. Several of the Megaman games. Super Mario Bros 2 + 3. This photo is borrowed from bootgod. The little chip at the top says MMC3B.


MMC3 PRG banks are $2000 in size. Max size is 512k (64 of those PRG banks).

MMC3 CHR banks are $400 in size. Max size is 256k (256 of those CHR banks). This extra small size makes it possible to change 1/4 of a tileset.

Although, only one of the tilesets can do this. The other can only change 1/2 at a time. Which one is optional, but I would prefer the BG to have the smaller banks, so that it more easily animate the backgroud without wasting too much memory.

MMC3 can have WRAM, some times with a battery for saving the game.

MMC3 can change between Horizontal and Vertical mirroring, freely.

Best of all. MMC3 has a scanline counter hardware IRQ. That means you can very exactly time mid-frame changes, such as scrolling changes. You can do parallax scrolling, like the train stage on Ninja Gaiden 2.

Since IRQ code needs to be programmed in Assembly, instead of having you, the casual C user, try to write your own… I made an automated system, that can do most anything you would need it to do mid-screen. (I will discuss a little later)

I don’t want to get into all the details of how the hardware works, which you can find here, if you want to read them.

There are multiple screen splits (timed by IRQ) and I’m animating the gears by swapping CHR banks. Also, the music is in a swappable bank.
Let’s go over every thing that had to be changed…


I made several $2000 byte swappable banks. The C code and all the libraries need to go in a fixed bank. I decided to keep $a000-ffff fixed to the last 3 banks. That means all the swappable banks will go in the $8000 slot… which is why we have all those banks start at $8000. Technically, MMC3 has the ability to swap the $a000-bfff bank, but I’m adapting the code from the MMC1 example, which only expects 1 swappable region, so let’s pretend that you can’t change $a000.

The startup code (reset code) needs to be in the very last $e000-ffff area, because this is the only bank that, by default, we know the position of. Vectors also go here.

Again, this mapper can have WRAM at $6000, so I made that segment to test it. You would need to change the BSS name to XRAM (what I called this segment) to declare a variable or array here.

In the Assembly code, just inserting a .segment like .segment “CODE” makes everything below that point map into that bank.

In the C code, you need to use a pragma, like
#pragma bss-name(push, “ZEROPAGE”)
#pragma bss-name(pop)


#pragma rodata-name (“BANK0”)
#pragma code-name (“BANK0”)

to direct the various types of things into the correct bank.

At the bottom of MMC3_128_128.cfg, it defines some symbols, NES_MAPPER=4.
And size of each ROM chip (NES_PRG_BANKS=8). MMC3 can be expanded to  512k PRG and 256k CHR, if you need.



This is the reset code and where most .s files are included. For example, mmc3_code.asm is included here. All in the “STARTUP” segment, so we know it is mapped to $e000-ffff, the default fixed bank.

I inserted a basic MMC3 initialization, which puts all the banks in place (and bank 0 into the $8000 slot). It also, importantly, does a CLI to allow IRQs to work on the CPU chip.

Also, I defined “SOUND_BANK 12”, and put the music data there, in BANK 12. All the music code will swap that bank in place, to play songs.


MMC3 folder


is all the hidden code that you shouldn’t have to worry about. Except if you want to change the A12_INVERT from $80 to 0. This determines which tileset will get the smaller bank $400 size and which will get the larger $800. I have it $80 so that background has the smaller tile banks, so that swapping BG CHR banks (to animate the background) uses less space.

If you prefer to have smaller sprite banks, change this number to 0. This also changes which CHR bank mode you need to change BG vs sprite tiles.

;if invert bit is 0
;mode 0 changes $0000-$07FF
;mode 1 changes $0800-$0FFF

;mode 2 changes $1000-$13FF
;mode 3 changes $1400-$17FF
;mode 4 changes $1800-$1BFF
;mode 5 changes $1C00-$1FFF

;if invert bit is $80
;mode 0 changes $1000-$17FF
;mode 1 changes $1800-$1FFF

;mode 2 changes $0000-$03FF
;mode 3 changes $0400-$07FF
;mode 4 changes $0800-$0BFF
;mode 5 changes $0C00-$0FFF


is the bank swapping code. This is the same as the MMC1 example.
You only need the banked_call() function. Anytime you call a function that is in a swappable bank, you use..

banked_call(unsigned char bankId, void (*method)(void))

(which bank is it in, the name of the function)
Don’t nest more than 10 function calls deep this way, or it will crash.


Here is the new stuff that you need to know.

set_prg_8000(unsigned char bank_id);

this changes which bank is mapped to $8000. I prefer that you use the banked_call() function, which automatically does this for you. You could use this to read data from a certain bank.


returns the number of the bank at $8000


don’t use this, for how I have it set up. It would change the bank mapped at $a000. Currently, we have code mapped here, which could crash if it was missing.

set_chr_mode_0(unsigned char chr_id);
set_chr_mode_1(unsigned char chr_id);
set_chr_mode_2(unsigned char chr_id);
set_chr_mode_3(unsigned char chr_id);
set_chr_mode_4(unsigned char chr_id);
set_chr_mode_5(unsigned char chr_id);

these change which parts of the CHR ROM are mapped to the tilesets.
See the chart above.

set_mirroring(unsigned char mirroring)

to change the PPU mirroring layout. Horizontal or Vertical.


this could turn the WRAM on and off. I turned it ON by default.


turns off the irq, and redirects the IRQ SYSTEM to point to 0xff (end).

et_irq_ptr(char * address)

turns on the IRQ SYSTEM, and points it to a char array.



So I should talk about this IRQ stuff. An IRQ is a hardware interrupt.
The MMC3 has a scanline counter. Once it is set, it counts down each line that is drawn. When it reaches zero, it jumps to the IRQ code. The standard use for the scanline IRQ is to make mid-screen scrolling changes, such as parallax scrolling.

The IRQ code would need to be written in Assembly. Instead of expecting you to learn this to make it work, I wrote an automated system that will parse a set of instructions (char array) and execute the necessary code behind the scenes. Here’s how it works.

A value < 0xf0, it’s a scanline count
zero is valid, it triggers an IRQ at the end of the current line

if >= 0xf0…
f0 = 2000 write, next byte is write value
f1 = 2001 write, next byte is write value
f2-f4 unused – future TODO ?
f5 = 2005 write, next byte is H Scroll value
f6 = 2006 write, next 2 bytes are write values

f7 = change CHR mode 0, next byte is write value
f8 = change CHR mode 1, next byte is write value
f9 = change CHR mode 2, next byte is write value
fa = change CHR mode 3, next byte is write value
fb = change CHR mode 4, next byte is write value
fc = change CHR mode 5, next byte is write value

fd = very short wait, no following byte
fe = short wait, next byte is quick loop value
(for fine tuning timing of things)

ff = end of data set
Once it sees a scanline value or an 0xff, it exits the parser.
Small example…
irq_array[0] = 47; // wait 48 lines

irq_array[1] = 0xf5; // 2005, H scroll change
irq_array[2] = 0; // change it to zero
irq_array[3] = 0xff; // end of data

This will cause it to, at the very top of the screen, set the scanline counter for 47. The screen will draw the top 48 lines (it’s usually 1 more than the number) then an IRQ will fire.

It will read the next lines, and change the Horizontal scroll to zero, then it will see the 0xff, and exit.

If it saw another scanline value (any # < 0xf0) it would instead set another scanline counter. And so on. Let’s look at the example.


There are 4 IRQ splits here (red lines). I’m doing quite a lot here. This is sort of an extreme example. Let’s go over the array, and see all what’s going on.


Anything put BEFORE the first scanline count will happen in v-blank,  and affect the entire screen.

irq_array[0] = 0xfc; // CHR mode 5, change the 0xc00-0xfff tiles
irq_array[1] = 8; // top of the screen, static value
irq_array[2] = 47; // value < 0xf0 = scanline count, 1 less than #

0xfc, change some tiles, mode 5, use CHR bank 8. Every frame this will always be 8 at the top of the screen. If you look at the video or the ROM, you see the top gear is not spinning. I did this on purpose to show the tile change mid-screen better.

It sees the 47, which is less than 0xf0, so it sets the scanline counter and exits.

At line 48 of the screen, an IRQ fires, the IRQ parser reads the next command.

irq_array[3] = 0xf5; // H scroll change, do first for timing
temp = scroll2 >> 8;
irq_array[4] = temp; // scroll value
irq_array[5] = 0xfc; // CHR mode 5, change the 0xc00-0xfff tiles
irq_array[6] = 8 + char_state; // value = 8,9,10,or 11
irq_array[7] = 29; // scanline count

First, it changes the Horizontal scroll. This code actually takes up an entire scanline, to try to time it near the end of a scanline so it’s not so visible. Note, you can’t change the Vertical scroll this way, only the Horizontal.

Then, it changes the background tiles, and cycles them through 4 options. This causes the gear to spin. The entire screen below this point is affected.

Then, it sets another scanline count and exits. 30 scanlines later another IRQ fires. It then reads this…

irq_array[8] = 0xf5; // H scroll change
temp = scroll3 >> 8;
irq_array[9] = temp; // scroll value
irq_array[10] = 0xf1; // $2001 test changing color emphasis
irq_array[11] = 0xfe; // value COL_EMP_DARK 0xe0 + 0x1e
irq_array[12] = 30; // scanline count

We change the H scroll again. Just to show what ELSE we can do, it’s writing a value to the $2001 register, setting all the color emphasis bits, which darkens the screen. If you move the sprite guy down, you see that he is also affected.

Another scanline count is set, for 30. It takes 31 lines, probably.
irq_array[16] = 0xf5; // H scroll change
irq_array[17] = 0; // to zero, to set the fine X scroll
irq_array[18] = 0xf6; // 2 writes to 2006 shifts screen
irq_array[19] = 0x20; // need 2 values…
irq_array[20] = 0x00; // PPU address $2000 = top of screen
irq_array[21] = 0xff; // end of data

And just to be fancy, I’m showing an example of using the $2006 PPU register mid-screen. But, to make it work, I also needed to set the $2005 register to zero with the 0xf5 command. The 0xf6 command is the only one that is expecting 2 values after it. High byte, then Low byte, of a PPU address.

Writing to $2006 mid-frame causes the scroll to realign to that address. In this example, the address $2000 is the top of the nametable, which means that the top of the nametable is drawn AGAIN, below that point.

Instead of seeing a value < 0xf0, it sees an 0xff, and exits without setting another scanline count. The parser does nothing until the top of the next line.


In the code, a double buffering system was added so that you aren’t editing the same array that the system is currently reading. Every irq_array[] reference above has been replaced with double_buffer[]. Then at the end of the frame logic, it waits till the IRQ System is done while(!is_irq_done() ).

is_irq_done() returns zero if it’s not done, and 0xff if it is done. The preceding ! not operator flips the result, so it becomes a “while this function returns zero, do nothing”.

Then, it copies the double_buffer array to the irq_array.


Swappable PRG banks

These work the same as the MMC1 code, except that the bank size is $2000.

To make the code / data go in BANK 0, use these directives

#pragma rodata-name (“BANK0”)
#pragma code-name (“BANK0”)

The function_bank0() function is compiled into bank 0. To call it, we use

banked_call(0, function_bank0);

0 is the bank, function_bank0 is the name of the function.

This jumps to banked_call(), which is in the fixed bank, it automatically swaps bank 0 in place, and then jumps to function_bank0() with a function pointer. Then it swaps the previous bank back in place, and returns.

As you see, this means that I can’t pass arguments to function_bank0().
So, we have to pass arguments via global variables. arg1, arg2, for example. Obviously, this isn’t safe practice. You could immediately pass those values to static local variables, to avoid errors.

If you call a function in the SAME swapped bank that you are already in, you can safely use regular function calls.

You can call a function from one swapped bank to another swapped bank. But, don’t keep going from one to another swapped bank too much, not more than 10 deep.

Other important notes.

For the IRQ scanline counter to work, background must use tilset #0 (which is the default)and sprites must use tileset #1 bank_spr(1). And, the screen must be on. ppu_on_all().

Here’s the link to the example code.

Future plans… maybe some time, I will make a UxROM (perhaps UNROM 512) code example. This would require VRAM, which would involve compression. Otherwise, the PRG works the same as the MMC1, which we already have working example code, so it wouldn’t be too difficult to convert.