Linking Zig to Pokemon Decomps
Tested with Zig 0.13.0-dev.35+e8f28cda9
. Source code on GitHub.
The pret group has a number of decompilations of Pokemon games:
- Pokemon FireRed/LeafGreen for the GBA - readable C
- Pokemon Ruby/Sapphire/Emerald for the GBA - readable C
- Pokemon Diamond/Pearl for the DS - mostly assembly, many unknown symbols
- Pokemon Platinum for the DS - mostly C, many unknown symbols
- Pokemon HeartGold/SoulSilver for the DS - mostly assembly, many unknown symbols
- and many more!
These repositories hold C and assembly code targetting the CPUs for the GBA and DS, ARM7TDMI
and ARM946E-S
respectively.
Dozens of contributors have put a lot of work into translating the original machine code into readable C.
Zig is a new programming language that offers an amazing toolchain and powerful metaprogramming capabilities while boasting great compatibility with C codebases. Todayâs project will focus on adding Zig support to the FireRed decompilation with a small display of comptime
functionality.
Adding Zig to the Build System
To start, we need to choose between two separate C compilers to build the project.
agbcc
(named after the codename âAGBâ CPU inside the GBA) is a fork of GCC with patches that ensure the compiled C code matches 1-to-1 with the original machine code. The compilation matches up so much that thesha1
builds of the original ROM and the recompiled ROM are the same!- devkitARM is a toolkit which includes the
arm-none-eabi-gcc
compiler, another fork of GCC. This compiler doesnât emit machine code that matches up with the original ROM, and is referred to as the âmodernâ compiler/build.
Iâll be using devkitARM.
Zig supports cross-compilation out of the box, and we can compile Zig source to our arm7tdmi
target.
zig build-obj input.zig -femit-bin=output.o -target thumb-freestanding-eabi -mcpu arm7tdmi
We want to plug this into pretâs build system for pokefired
. The sources are split into a few different directories.
src/
contains the game engine and is in Casm/
contains assembly macrosdata/
contains all kinds of data for the game and scripts in assembly & includes files
These are compiled and combined using a Makefile
which calls a bunch of specialized tools in the tools/
dir.
There are a few patterns these directories follow that we can copy for a new zig/
directory.
+ ZIG := zig
...
C_BUILDDIR = $(OBJ_DIR)/$(C_SUBDIR)
+ ZIG_BUILDDIR = $(OBJ_DIR)/$(ZIG_SUBDIR)
ASM_BUILDDIR = $(OBJ_DIR)/$(ASM_SUBDIR)
DATA_ASM_BUILDDIR = $(OBJ_DIR)/$(DATA_ASM_SUBDIR)
SONG_BUILDDIR = $(OBJ_DIR)/$(SONG_SUBDIR)
...
C_SRCS := $(wildcard $(C_SUBDIR)/*.c)
C_OBJS := $(patsubst $(C_SUBDIR)/%.c,$(C_BUILDDIR)/%.o,$(C_SRCS))
+ ZIG_SRCS := $(wildcard $(ZIG_SUBDIR)/*.zig)
+ ZIG_OBJS := $(patsubst $(ZIG_SUBDIR)/%.zig,$(ZIG_BUILDDIR)/%.o,$(ZIG_SRCS))
...
- OBJS := $(C_OBJS) $(C_ASM_OBJS) $(ASM_OBJS) $(DATA_ASM_OBJS) $(SONG_OBJS) $(MID_OBJS)
+ OBJS := $(C_OBJS) $(C_ASM_OBJS) $(ASM_OBJS) $(DATA_ASM_OBJS) $(SONG_OBJS) $(MID_OBJS) $(ZIG_OBJS)
The C code is a little abnormal - itâs run through the standard C preprocessor, a custom preprocessor, and then is compiled to assembly. The assembly has some lines appended to it and is assembled to an object file.
Our Zig code will not be doing any of that!
$(C_BUILDDIR)/%.o : $(C_SUBDIR)/%.c $$(c_dep)
@$(CPP) $(CPPFLAGS) $< -o $(C_BUILDDIR)/$*.i
@$(PREPROC) $(C_BUILDDIR)/$*.i charmap.txt | $(CC1) $(CFLAGS) -o $(C_BUILDDIR)/$*.s
@echo -e ".text\n\t.align\t2, 0 @ Don't pad with nop\n" >> $(C_BUILDDIR)/$*.s
$(AS) $(ASFLAGS) -o $@ $(C_BUILDDIR)/$*.s
$(ZIG_BUILDDIR)/%.o : $(ZIG_SUBDIR)/%.zig
$(ZIG) build-obj $< -femit-bin=$@ -target thumb-freestanding-eabi -mcpu arm7tdmi
That custom preprocessing step will show up later!
To let C code call Zig, we need to create some header files that declare the Zig symbols present. Zig currently doesnât have a way to generate these, so for every Zig file we will need to manually write the corresponding header file.
To start Iâll add a custom string that we can see at the start of the game, during Professor Oakâs speech about the controls. We can put this code in zig/
.
// zig/oak.zig (broken)
pub const testZigString = "Zig";
To make this visible from C, we need to declare the extern
string.
// include/zig/oak.h
extern const unsigned char testZigString[];
This is then accessible from the C source.
+ #include "zig/oak.h"
...
- TopBarWindowPrintString(gText_ABUTTONNext_BBUTTONBack, 0, TRUE);
+ TopBarWindowPrintString(testZigString, 0, TRUE);
This wonât work because the resulting Zig object file wonât have our testZigString
in it! Unused top level declarations arenât present in the final object. This results in a linking error. To ensure it is usable from C, we can use the export
keyword.
pub export const testZigString = "Zig";
export
is syntactic sugar for calling @export
to export a symbol at comptime
.
// export syntax
export fn publicName() void {}
// is the same as:
comptime {
@export(internalName, .{ .name = "publicName", .linkage = .strong });
}
fn internalName() callconv(.C) void {}
This implies the .C
calling convention which matches the C ABI for the target.
There are 17 other calling conventions you can specify on functions, which seem to match up to the calling conventions LLVM offers (though some are missing from this list like AMD GPU / NVPTX / SPIR-V .Kernel
and vulkan-only .Fragment
/ .Vertex
, but those might be covered by numbered conventions?).
Pokemon Character Mapping
Now everything compiles but we get some garbage data:
This is because the Pokemon string renderer doesnât use ASCII! It has its own character map! This is defined in charmap.txt
and contains all kinds of chars and multibyte sequences for placeholders.
' ' = 00
'Ă' = 01
'Ă' = 02
'Ă' = 03
'Ă' = 04
'Ă' = 05
'Ă' = 06
'Ă' = 07
'Ă' = 08
...
'\'' = B4
'â' = B5
'â' = B6
'„' = B7
',' = B8
'Ă' = B9
'/' = BA
'A' = BB
'B' = BC
'C' = BD
'D' = BE
'E' = BF
'F' = C0
...
'Ă€' = F4
'ö' = F5
'ĂŒ' = F6
@ Arrows at F7-FA are duplicates of 79-7C. Unused?
TALL_PLUS = FC 0C FB
'$' = FF
@ Hiragana
'ă' = 01
'ă' = 02
'ă' = 03
'ă' = 04
'ă' = 05
...
STRING = FD
@ string placeholders
PLAYER = FD 01
STR_VAR_1 = FD 02
STR_VAR_2 = FD 03
STR_VAR_3 = FD 04
KUN = FD 05
RIVAL = FD 06
@ version-dependent strings
VERSION = FD 07 @ "RUBY" / "SAPPHIRE"
EVIL_TEAM = FD 08 @ "MAGMA" / "AQUA"
GOOD_TEAM = FD 09 @ "AQUA" / "MAGMA"
EVIL_LEADER = FD 0A @ "MAXIE" / "ARCHIE"
GOOD_LEADER = FD 0B @ "ARCHIE" / "MAXIE"
EVIL_LEGENDARY = FD 0C @ "GROUDON" / "KYOGRE"
GOOD_LEGENDARY = FD 0D @ "KYOGRE" / "GROUDON"
Looking at the C source for all of the strings in src/strings.c
, we can see the strings are all wrapped in some unique syntax.
const u8 gText_Controls[] = _("CONTROLS");
ALIGNED(4) const u8 gText_PickOk[] = _("{DPAD_UPDOWN}ăă㶠{A_BUTTON}ăăŁăŠă");
ALIGNED(4) const u8 gText_ABUTTONNext[] = _("{A_BUTTON}NEXT");
ALIGNED(4) const u8 gText_ABUTTONNext_BBUTTONBack[] = _("{A_BUTTON}NEXT {B_BUTTON}BACK");
ALIGNED(4) const u8 gText_UPDOWNPick_ABUTTONNext_BBUTTONBack[] = _("{DPAD_UPDOWN}PICK {A_BUTTON}NEXT {B_BUTTON}CANCEL");
ALIGNED(4) const u8 gText_UPDOWNPick_ABUTTONBBUTTONCancel[] = _("{DPAD_UPDOWN}PICK {A_BUTTON}{B_BUTTON}CANCEL");
ALIGNED(4) const u8 gText_ABUTTONExit[] = _("{A_BUTTON}EXIT");
const u8 gText_Boy[] = _("BOY");
const u8 gText_Girl[] = _("GIRL");
const u8 gText_PokedexTableOfContents[] = _("POKĂ©DEX TABLE OF CONTENTS");
_()
is defined in include/global.h
⊠but doesnât do anything?
// IDE support
#if defined(__APPLE__) || defined(__CYGWIN__) || defined(__INTELLISENSE__)
// We define these when using certain IDEs to fool preproc
#define _(x) (x)
#define __(x) (x)
So how are these strings translated into the proper character map? Letâs look at that custom preprocessing step again!
@$(PREPROC) $(C_BUILDDIR)/$*.i charmap.txt | $(CC1) $(CFLAGS) -o $(C_BUILDDIR)/$*.s
tools/preproc/preproc
is a C++ program that modifies C and assembly source.
It will translate an ASCII (or UTF-8?) string to a byte array literal containing the mapped chars. This line applys charmap.txt
to the preprocessor-expanded C source files. The result is then fed into the C compiler. Hereâs an example of what the transformation looks like:
// example C preproc application
// before
const u8 exText_Zig[] = _("Zig");
// after
const u8 exText_Zig[] = { 0xD4, 0xDD, 0xDB, 0xFF }
0xFF
is the sentinel value the games use to designate the end of a string. These appear to be added on automatically to every string after remapping.
With Zig, we donât need to do any preprocessor shenanigans. We can build a comptime
function to manipulate strings so the final byte arrays present in our object file use the proper mappings. There are a few interesting Zig features to note in my implementation of this mapping.
pub export const testZigString = mapStr("Zig");
/// Map a string to the character set used by FireRed.
/// This will also append a sentinel value.
fn mapStr(comptime str: []const u8) [str.len + 1]c_char {
comptime var buffer = [_]c_char{0xFF} ** (str.len + 1);
for (str, 0..) |ch, i| {
buffer[i] = mapChar(ch);
}
return buffer;
}
This function takes in a comptile-time known string slice ([]const u8
) and returns a string ([..]c_char
). Letâs break down some of the Zig-isms in here.
fn mapStr(comptime str: []const u8) [ ... ]c_char {
The return value uses c_char
to match what the C code consuming these values is expecting. The original string values use u8
in the C source, but note this is actually a typedef for unsigned char
. Primitive types in Zig are not guaranteed to be compatible with the C ABI, nor is char
guarenteed to be 8 bits! c_char
indicates to the Zig compiler that the data type size should match what the targetâs C ABI uses. For our target this wonât matter because a char is 8 bits, but itâs nice to know Zig is capable of expressing the differences here without tying its own primitive types to the C ABI. We could compile this code to interface with old, non-POSIX C (because POSIX mandates char == 8 bits
) and Zig would handle the type/ABI differences.
[str.len + 1]c_char
The return type has a length specific between the []
symbols. This type is a normal Zig expression that can contain literals, references, field access, operators, function calls, and more. Here we set it to str.len + 1
to account for the sentinel value we add. str
is available to use here because its value is known at compile-time. This is an important property we can express in Zig! The call site knows at comptime that the result of mapStr("Zig")
will be [4]c_char
.
comptime var buffer = [_]c_char{0xFF} ** (str.len + 1);
The next line declares a buffer filled with 0xFF
. The [_]foo{D} ** N
syntax declares an array of length N
filled with the default value D
. The length of this buffer is the same as the type weâre going to return, str.len + 1
. comptime var
indicates this variable exists at compile-time and is mutable by other compile-time code.
mapChar
is what implements the actual mapping. This doesnât account for the multibyte sequences the original mapper substitutes, but itâs a start.
fn mapChar(ch: u8) u8 {
return switch (ch) {
'0' => 0xA1,
'1' => 0xA2,
'2' => 0xA3,
'3' => 0xA4,
'4' => 0xA5,
'5' => 0xA6,
'6' => 0xA7,
'7' => 0xA8,
...
'Ă€' => 0xF4,
'ö' => 0xF5,
'ĂŒ' => 0xF6,
'$' => 0xFF,
else => ch,
};
}
With these changes the string will use the proper encoding in the final object file Zig produces. In game we can see the top bar renders our string correctly!
Overall Zig ended up being very simple to integrate. Next is to start building out features in Zig that utilize the gameâs C headers. This effort may be better spent on the RH Hidehoutâs pokeemerald-expansion project, which includes a number of improvements to the game engine already. For now Iâll be focusing on some other projects I have in mind, but Zig is looking like a promising new tool in my belt.