Linking Zig to Pokemon Decomps

zigmodding

Tested with Zig 0.13.0-dev.35+e8f28cda9. Source code on GitHub.

The pret group has a number of decompilations of Pokemon games:

These repositories hold C and assembly code targetting the CPUs for the GBA and DS, ARM7TDMI and ARM946E-S respectively. Dozens of contributors have put a lot of work into translating the original machine code into readable C.

Zig is a new programming language that offers an amazing toolchain and powerful metaprogramming capabilities while boasting great compatibility with C codebases. Today’s project will focus on adding Zig support to the FireRed decompilation with a small display of comptime functionality.

To start, we need to choose between two separate C compilers to build the project.

  1. agbcc (named after the codename “AGB” CPU inside the GBA) is a fork of GCC with patches that ensure the compiled C code matches 1-to-1 with the original machine code. The compilation matches up so much that the sha1 builds of the original ROM and the recompiled ROM are the same!
  2. devkitARM is a toolkit which includes the arm-none-eabi-gcc compiler, another fork of GCC. This compiler doesn’t emit machine code that matches up with the original ROM, and is referred to as the “modern” compiler/build.

I’ll be using devkitARM. Zig supports cross-compilation out of the box, and we can compile Zig source to our arm7tdmi target.

zig build-obj input.zig -femit-bin=output.o -target thumb-freestanding-eabi -mcpu arm7tdmi

We want to plug this into pret’s build system for pokefired. The sources are split into a few different directories.

  • src/ contains the game engine and is in C
  • asm/ contains assembly macros
  • data/ contains all kinds of data for the game and scripts in assembly & includes files

These are compiled and combined using a Makefile which calls a bunch of specialized tools in the tools/ dir. There are a few patterns these directories follow that we can copy for a new zig/ directory.

+ ZIG := zig

...

  C_BUILDDIR = $(OBJ_DIR)/$(C_SUBDIR)
+ ZIG_BUILDDIR = $(OBJ_DIR)/$(ZIG_SUBDIR)
  ASM_BUILDDIR = $(OBJ_DIR)/$(ASM_SUBDIR)
  DATA_ASM_BUILDDIR = $(OBJ_DIR)/$(DATA_ASM_SUBDIR)
  SONG_BUILDDIR = $(OBJ_DIR)/$(SONG_SUBDIR)

...

  C_SRCS := $(wildcard $(C_SUBDIR)/*.c)
  C_OBJS := $(patsubst $(C_SUBDIR)/%.c,$(C_BUILDDIR)/%.o,$(C_SRCS))
+ ZIG_SRCS := $(wildcard $(ZIG_SUBDIR)/*.zig)
+ ZIG_OBJS := $(patsubst $(ZIG_SUBDIR)/%.zig,$(ZIG_BUILDDIR)/%.o,$(ZIG_SRCS))

...

- OBJS := $(C_OBJS) $(C_ASM_OBJS) $(ASM_OBJS) $(DATA_ASM_OBJS) $(SONG_OBJS) $(MID_OBJS)
+ OBJS := $(C_OBJS) $(C_ASM_OBJS) $(ASM_OBJS) $(DATA_ASM_OBJS) $(SONG_OBJS) $(MID_OBJS) $(ZIG_OBJS)

The C code is a little abnormal - it’s run through the standard C preprocessor, a custom preprocessor, and then is compiled to assembly. The assembly has some lines appended to it and is assembled to an object file.

Our Zig code will not be doing any of that!

$(C_BUILDDIR)/%.o : $(C_SUBDIR)/%.c $$(c_dep)
	@$(CPP) $(CPPFLAGS) $< -o $(C_BUILDDIR)/$*.i
	@$(PREPROC) $(C_BUILDDIR)/$*.i charmap.txt | $(CC1) $(CFLAGS) -o $(C_BUILDDIR)/$*.s
	@echo -e ".text\n\t.align\t2, 0 @ Don't pad with nop\n" >> $(C_BUILDDIR)/$*.s
	$(AS) $(ASFLAGS) -o $@ $(C_BUILDDIR)/$*.s

$(ZIG_BUILDDIR)/%.o : $(ZIG_SUBDIR)/%.zig
	$(ZIG) build-obj $< -femit-bin=$@ -target thumb-freestanding-eabi -mcpu arm7tdmi

That custom preprocessing step will show up later!

To let C code call Zig, we need to create some header files that declare the Zig symbols present. Zig currently doesn’t have a way to generate these, so for every Zig file we will need to manually write the corresponding header file.

To start I’ll add a custom string that we can see at the start of the game, during Professor Oak’s speech about the controls. We can put this code in zig/.

// zig/oak.zig (broken)
pub const testZigString = "Zig";

To make this visible from C, we need to declare the extern string.

// include/zig/oak.h
extern const unsigned char testZigString[];

This is then accessible from the C source.

+ #include "zig/oak.h"

...

- TopBarWindowPrintString(gText_ABUTTONNext_BBUTTONBack, 0, TRUE);
+ TopBarWindowPrintString(testZigString, 0, TRUE);

This won’t work because the resulting Zig object file won’t have our testZigString in it! Unused top level declarations aren’t present in the final object. This results in a linking error. To ensure it is usable from C, we can use the export keyword.

pub export const testZigString = "Zig";

export is syntactic sugar for calling @export to export a symbol at comptime.

// export syntax
export fn publicName() void {}

// is the same as:
comptime {
    @export(internalName, .{ .name = "publicName", .linkage = .strong });
}

fn internalName() callconv(.C) void {}

This implies the .C calling convention which matches the C ABI for the target. There are 17 other calling conventions you can specify on functions, which seem to match up to the calling conventions LLVM offers (though some are missing from this list like AMD GPU / NVPTX / SPIR-V .Kernel and vulkan-only .Fragment / .Vertex, but those might be covered by numbered conventions?).

Now everything compiles but we get some garbage data:

Zig string injected in that looks broken.

This is because the Pokemon string renderer doesn’t use ASCII! It has its own character map! This is defined in charmap.txt and contains all kinds of chars and multibyte sequences for placeholders.

' '         = 00
'À'         = 01
'Á'         = 02
'Â'         = 03
'Ç'         = 04
'È'         = 05
'É'         = 06
'Ê'         = 07
'Ë'         = 08

...

'\''        = B4
'♂'         = B5
'♀'         = B6
'„'         = B7
','         = B8
'×'         = B9
'/'         = BA
'A'         = BB
'B'         = BC
'C'         = BD
'D'         = BE
'E'         = BF
'F'         = C0

...

'Ă€'         = F4
'ö'         = F5
'ĂŒ'         = F6
@ Arrows at F7-FA are duplicates of 79-7C. Unused?
TALL_PLUS   = FC 0C FB
'$'         = FF

@ Hiragana
'あ' = 01
'い' = 02
'う' = 03
'え' = 04
'お' = 05

...

STRING = FD

@ string placeholders
PLAYER         = FD 01
STR_VAR_1      = FD 02
STR_VAR_2      = FD 03
STR_VAR_3      = FD 04
KUN            = FD 05
RIVAL          = FD 06
@ version-dependent strings
VERSION        = FD 07 @ "RUBY"    / "SAPPHIRE"
EVIL_TEAM      = FD 08 @ "MAGMA"   / "AQUA"
GOOD_TEAM      = FD 09 @ "AQUA"    / "MAGMA"
EVIL_LEADER    = FD 0A @ "MAXIE"   / "ARCHIE"
GOOD_LEADER    = FD 0B @ "ARCHIE"  / "MAXIE"
EVIL_LEGENDARY = FD 0C @ "GROUDON" / "KYOGRE"
GOOD_LEGENDARY = FD 0D @ "KYOGRE"  / "GROUDON"

Looking at the C source for all of the strings in src/strings.c, we can see the strings are all wrapped in some unique syntax.

const u8 gText_Controls[] = _("CONTROLS");
ALIGNED(4) const u8 gText_PickOk[] = _("{DPAD_UPDOWN}ăˆă‚‰ă¶ {A_BUTTON}けっどい");
ALIGNED(4) const u8 gText_ABUTTONNext[] = _("{A_BUTTON}NEXT");
ALIGNED(4) const u8 gText_ABUTTONNext_BBUTTONBack[] = _("{A_BUTTON}NEXT {B_BUTTON}BACK");
ALIGNED(4) const u8 gText_UPDOWNPick_ABUTTONNext_BBUTTONBack[] = _("{DPAD_UPDOWN}PICK {A_BUTTON}NEXT {B_BUTTON}CANCEL");
ALIGNED(4) const u8 gText_UPDOWNPick_ABUTTONBBUTTONCancel[] = _("{DPAD_UPDOWN}PICK {A_BUTTON}{B_BUTTON}CANCEL");
ALIGNED(4) const u8 gText_ABUTTONExit[] = _("{A_BUTTON}EXIT");
const u8 gText_Boy[] = _("BOY");
const u8 gText_Girl[] = _("GIRL");
const u8 gText_PokedexTableOfContents[] = _("POKĂ©DEX   TABLE OF CONTENTS");

_() is defined in include/global.h
 but doesn’t do anything?

// IDE support
#if defined(__APPLE__) || defined(__CYGWIN__) || defined(__INTELLISENSE__)
// We define these when using certain IDEs to fool preproc
#define _(x)        (x)
#define __(x)       (x)

So how are these strings translated into the proper character map? Let’s look at that custom preprocessing step again!

@$(PREPROC) $(C_BUILDDIR)/$*.i charmap.txt | $(CC1) $(CFLAGS) -o $(C_BUILDDIR)/$*.s

tools/preproc/preproc is a C++ program that modifies C and assembly source. It will translate an ASCII (or UTF-8?) string to a byte array literal containing the mapped chars. This line applys charmap.txt to the preprocessor-expanded C source files. The result is then fed into the C compiler. Here’s an example of what the transformation looks like:

// example C preproc application

// before
const u8 exText_Zig[] = _("Zig");

// after
const u8 exText_Zig[] = { 0xD4, 0xDD, 0xDB, 0xFF }

0xFF is the sentinel value the games use to designate the end of a string. These appear to be added on automatically to every string after remapping.

With Zig, we don’t need to do any preprocessor shenanigans. We can build a comptime function to manipulate strings so the final byte arrays present in our object file use the proper mappings. There are a few interesting Zig features to note in my implementation of this mapping.

pub export const testZigString = mapStr("Zig");

/// Map a string to the character set used by FireRed.
/// This will also append a sentinel value.
fn mapStr(comptime str: []const u8) [str.len + 1]c_char {
    comptime var buffer = [_]c_char{0xFF} ** (str.len + 1);
    for (str, 0..) |ch, i| {
        buffer[i] = mapChar(ch);
    }
    return buffer;
}

This function takes in a comptile-time known string slice ([]const u8) and returns a string ([..]c_char). Let’s break down some of the Zig-isms in here.

fn mapStr(comptime str: []const u8) [ ... ]c_char {

The return value uses c_char to match what the C code consuming these values is expecting. The original string values use u8 in the C source, but note this is actually a typedef for unsigned char. Primitive types in Zig are not guaranteed to be compatible with the C ABI, nor is char guarenteed to be 8 bits! c_char indicates to the Zig compiler that the data type size should match what the target’s C ABI uses. For our target this won’t matter because a char is 8 bits, but it’s nice to know Zig is capable of expressing the differences here without tying its own primitive types to the C ABI. We could compile this code to interface with old, non-POSIX C (because POSIX mandates char == 8 bits) and Zig would handle the type/ABI differences.

[str.len + 1]c_char

The return type has a length specific between the [] symbols. This type is a normal Zig expression that can contain literals, references, field access, operators, function calls, and more. Here we set it to str.len + 1 to account for the sentinel value we add. str is available to use here because its value is known at compile-time. This is an important property we can express in Zig! The call site knows at comptime that the result of mapStr("Zig") will be [4]c_char.

comptime var buffer = [_]c_char{0xFF} ** (str.len + 1);

The next line declares a buffer filled with 0xFF. The [_]foo{D} ** N syntax declares an array of length N filled with the default value D. The length of this buffer is the same as the type we’re going to return, str.len + 1. comptime var indicates this variable exists at compile-time and is mutable by other compile-time code.

mapChar is what implements the actual mapping. This doesn’t account for the multibyte sequences the original mapper substitutes, but it’s a start.

fn mapChar(ch: u8) u8 {
    return switch (ch) {
        '0' => 0xA1,
        '1' => 0xA2,
        '2' => 0xA3,
        '3' => 0xA4,
        '4' => 0xA5,
        '5' => 0xA6,
        '6' => 0xA7,
        '7' => 0xA8,
        ...
        'Ă€' => 0xF4,
        'ö' => 0xF5,
        'ĂŒ' => 0xF6,
        '$' => 0xFF,
        else => ch,
    };
}

With these changes the string will use the proper encoding in the final object file Zig produces. In game we can see the top bar renders our string correctly!

Zig string injected in that looks correct.

Overall Zig ended up being very simple to integrate. Next is to start building out features in Zig that utilize the game’s C headers. This effort may be better spent on the RH Hidehout’s pokeemerald-expansion project, which includes a number of improvements to the game engine already. For now I’ll be focusing on some other projects I have in mind, but Zig is looking like a promising new tool in my belt.