Alox: Designing an Actor Model on the GPU

2020·04·03 · language design

Intro


Last November, I drew out these plans for Alox in my Discrete Math class. Alox is a language I am continuously “creating” and “putting off until I have more time.” Now that the world has closed its doors for a while, I have found myself with the time needed to work on side projects.

Alox’s main design goal is this: execution can be much faster if we distribute across many machines. This goal was inspired from languages like Erlang and Pony where distributing computation is natural. Personally, I find these languages difficult to work in. I am a stickler for a good type system, and Erlang’s dynamic typing doesn’t appeal to me. Pony recognizes this and goes way off the deep end, bringing a very complex type system to the table. Its capabilities are surely amazing, but in the little time I have spent in the language, I have found it unintuative to write large aplications.

In Alox, I want to make it intuative to create and maintain large applications. Things like proper tooling, excellent package management, and helpful error messages will help make it intuative. Tooling, of course, has to come after the language, right? How’s the language coming along?

Implementation Woes

Well, it’s not done yet. Not even close. I have had many failed attempts at compilers, but my most recent, alox-lang/alox, has had a number of issues. Namely, memory management sucks.

Rust is getting on my nerves

Rust is a great language. Having written almost 50,000 lines of Rust in the last couple years, I can say Rust’s borrow checker is a pain in the ass. It is absolutely doing its job correctly, but there have been so many times where I want to tell Rust, “I know what I’m doing”. Oddly enough, I can’t decide on what memory management strategy works best for me. The JVM’s GC works wonders on large applications, but for smaller ones I don’t want to bundle such a large runtime. I think many people are coming to the realization that we don’t need large runtimes for small apps. The flip side of this is memory safety. I don’t want to manage memory myself, so I’ll let my huge runtime do it for me. This proved quite effective and efficient, but speed was hindered. My huge application didn’t take long to write, but it’s running slower than a snail!

Rust brings the best of both worlds: I can write my application swiftly, it’s memory safe, and it’s still blazingly fast! …right?

Well, with my experience, writing applications in Rust takes time. A common complaint from beginners is that the borrow checker is difficult to deal with. You end up wrestling with it forever. But once you understand the ins and outs of the borrow checker and the rules become ingrained in your memory, you can write memory safe applications really fast! Except, this isn’t the reality for a lot of developers. I can write Rust pretty fast, but after years I am still wrestling with it. There are explicit rules that can’t be broken whatsoever, even if you know you can break them.

The current implementation of Alox is written in Rust. It has a lexer and parser that converts the source code into an AST which is then converted to an IR. This IR is further optimized and the only backend I have started is the CraneLift backend. Overall, there has been a lot of success with this iteration. The main pain point has been the IR. I want the IR to be operated on by multiple threads, meaning some threads are optimizing, others are still parsing, and others are already at codegen. This proved difficult in Rust. Very difficult.

The core structure that is used in many spots in the IR is the concept of a reference to a declaration that doesn’t exist yet. For example, here’s what the Function structure would look like in Kotlin:

class Function(
    val name: String,
    val arguments: List<Pair<String, DeclarationContainer>>,
    val returnType: DeclarationContainer,
    val blocks: List<Block>,
)

DeclarationContainer in this context is the “declaration hole” as I call it. This hole is filled once the declaration is parsed and becomes available. In Kotlin, this type may be as simple as a class holding a nullable reference.

class DeclarationContainer(
    val declaration: Declaration?
)

Notice when we use this class, it is essentially a pointer to a nullable pointer. How do we describe this in Rust?

Well, first let’s translate the basics: &Option<&T> is a reference to an optional reference to a type. Perfect! We succesfully translated it! Ship it! …except this isn’t thread safe. And if something isn’t thread safe in Rust, it can not be used by multiple threads. So, how do we translate this to be thread safe and mutable? The answer to that is far from trivial.

From what I have read, the defacto mutable pointer is an Arc<Mutex<T>>. Arc lets you copy the reference to multiple threads and Mutex lets you mutate the result by requiring any threads with access to wait for a lock. If you want to allow multiple readers at once, you can replace Mutex with RwLock: Arc<RwLock<T>>. Rust boasts fearless concurrency and zero cost abstractions, but you can’t have both: to make concurrency fearless, you need to tack on runtime costs. Arcs use reference counting and Mutex and RwLock make threads halt by require obtaining a lock at runtime.

How do we describe what I want in a declaration container? Here’s what I got:

// Thread safe reference to a mutable option of a thread safe reference to a declaration
// acts as a declaration "hole" that needs to be filled
#[derive(Clone, Debug)]
pub struct DeclarationContainer(pub Arc<Mutex<Option<Arc<RwLock<Declaration>>>>>);

While I haven’t made a large enough codebase in Alox to test this, I imagine these costs will add up quickly. With every declaration having to count a reference when it’s created and ask for a lock every time it’s used, it will be incredbily slow. On top of this cost, it is difficult to use this structure. Something as simple as getting the name of the declaration requires two locks:

pub fn name(&self) -> String {
    let guard = self.0.lock().unwrap();
    if let Some(ref dec) = *guard {
        return dec.read().unwrap().name();
    }
    String::from("notfound")
}

Frankly, this is ugly! I am by far a Rust master, but this seemed to be the best solution I and the other Rust users I consulted could come up with. The tools Rust provides are nice, but they introduce a lot of bloat.

There are a number of other issues I could detail about Rust, but I think I’ve spoken enough about it. All of this has made me rethink my choice of language.

Enter Zig

Zig is a promising new language that brings some brilliant ideas to the table. The biggest pros are that I can manage memory however I please with constructs providing the safety I want, and that I can call C code directly, without having to create or rely on bindings. Our complex declaration type from earlier becomes very simple: *?*T (I’ve never made a big Zig project but this looks correct). It is up to me to manage the thread safety, but this will make iterating designs far quicker. I believe this will make developing the compiler a much better experience.

This will make the compiler’s stability depend on the Zig compiler’s stability, but this won’t be an issue dow the line.

Bindings

I want this language to target many CPUs and many GPUs. Luckily, LLVM can handle the first part very well. It covers all of the modern ISAs and then some. For the GPU, I was planning on targetting SPIR-V, CUDA, and maybe OpenCL. Targetting all of these backends means using bindings for every single one. rspirv for SPIR-V, inkwell or llvm-sys for LLVM, libcuda for CUDA, etc. This adds many more dependencies into the tree that don’t need to be there. inkwell providesa safe abstraction layer on top of the LLVM C API. The main problem with this is that the layer adds bloat, and this is another thing we have to keep updated. inkwell only supports up to LLVM 8, so I can’t use the lovely new matrix intrinsics added in LLVM 10! The same goes for all of these other dependencies.

With Zig, things change dramatically. How do you include LLVM in Zig?

pub const llvm = @cImport({
    @cInclude("llvm-c/Core.h");
    @cInclude("llvm-c/ExecutionEngine.h");
    @cInclude("llvm-c/Target.h");
    @cInclude("llvm-c/TargetMachine.h");
    @cInclude("llvm-c/Analysis.h");
    @cInclude("llvm-c/BitWriter.h");
});

Boom. Bam. Done. Complete. Game Over. Ship it.

As long as the library has a C API, which many libraries do, you can directly use it in Zig.

MLIR

MLIR, which stands for “Multi-Level Intermediate Representation”, is a promising new backend which is now under the LLVM umbrella. It boasts a “reusable and extensible compiler infrastructure” which aims to ease the compilation of different target platforms, namely CPUs and GPUs, which is exactly what I am wanting! MLIR is still in the works, and I still have a lot to learn before I get started with it, but I anticipate it will be a vital part of Alox in the near future. It doesn’t have a C API yet so some glue code might be required.

Design

I have talked a lot about the design for the compiler of Alox, but what about the language itself? Don’t worry, I’ve put quite a lot of thought behind the language too!

Actor Model

Having used a number of concurrency models, the Actor Model looks like the most pleasing. To start, you construct an Actor like you would a Struct or Class.

actor A {}

Then, you add behaviours to it. These behaviours act as concurrent functions that run in reference to the Actor.

actor A {
    behave foo() { }
}

Calling a behaviour is different from calling a normal function. Instead of the body being executed immediately, it is scheduled to run. The runtime manages the scheduling and execution of behaviours. These behaviours have some interesting characteristics:

Messages

Messages contain three parts: the id of the actor it’s for, the name of the behaviour to run, and the serialized argumets.

Let’s look at a more complex example.

actor A {
    behave ping(n: Int32, b: &B) {
        b.pong(n, &this)
    }
}

actor B {
    behave pong(n: Int32, a: &A) {
        a.ping(n, &this)
    }
}

actor Main {
    behave main() {
        let n = 2
        let a = new A()
        let b = new B()
        a.ping(n, &b)
    }
}

Main.main will create two new actors and start off the pinging and ponging. a.ping(n, &b) will queue a mesage to be sent that will send whatever the value of n is and the ID of the b actor. Every actor has an ID that is created by the scheduler. Let’s discuss how the scheduling works…

The Scheduler, Worker Threads, & Mailmen

The Scheduler is the big boy who controls everything is. When you start an Alox instance, you start The Scheduler. Each scheduler is its own universe that communicates with other schedulers. You can think of it as the process you see when you run top. You can run multiple schedulers on one device, but the proper usage would be to run them on different devices. They communicate over some protocol I haven’t designed yet going probably over TCP.

Schedulers spawn two important things: worker threads and mailmen. Mailmen put messages in the mailbox, and workers read the messages and execute the behaviours. Mailboxes act as the middle men. They’re generic multi-consumer multi-producer (MCMP) queues. Mailmen run on their own threads, there will probably be a few per scheduler, and every call to a behaviour will be a direct call to a mailman. Mailmen are also responsible for sending messages across the wire. If the ID in the message refers to an actor running under a different scheduler, the mailman is responsible for passing the message through his own scheduler to a mailman on the correct scheduler.

Worker threads consume a message from the mailbox and run the designated behaviour on the actor it refers to. This will run the AOT compiled code corresponding to the function with a pointer to the actor structure.

For example, a.ping(n, &this) will compile to instructions that tell the Mailman to send a message to the mailbox. Then, the worker thread will read the message and run the generated function A_ping(&a, b, &this).

Actors on the GPU

I’ve talked a bit about the actor structure, but how do we throw these things on the GPU?

First, the actors themselves won’t be on the GPU. The only code running on the GPU will be GPU specific functions, or kernels as they’re usually called. These are called by actors. Let’s run through an example (syntax / semantics not at all final):

actor B {
    behave pong(n: Int32, a: &A) {
        let x = n & 0xF0 >> 4
        let y = n & 0x0F
        let arr = [x, y]
        let newArr = process(&mut arr)
        let z = (newArr[0] << 4) | newArr[1]
        a.ping(z, &this)
    }

    fun(spirv) process(arr: &mut [Int32]) {
        for (x in arr) {
            x *= 17 + (2 * x)
        }
    }
}

Here we separate our n value from before into an array of integers which is passed to the process kernel. process has a special tag on it that says it will compile to SPIR-V instructions. (Again, none of these semantics are final, I’m just trying to get the ideas across. It might make more sense to say fun(gpu) or kernel and leave the implementation details to the compiler. &mut is obviously inspired by Rust and a borrow checker will probably not be used. The memory management model has not been designed yet.) When the worker thread runs ping, it will be instructed to run the compiled kernel and pass the results back to ping. The worker thread will be in charge of moving the data between the CPU and GPU.

Actor Functions and Fields

Functions hold a special place in Alox in that they have less restrictions than behaviours but they are private to the actor they are declared in. In the last example, process can be called at any time in any behaviour but it can not be called by any other actor. This is because functions can be passed references to structures that are in the local process’ stack or heap. arr here would be the local allocation not observable from outside actors.

Functions also hold the ability to modify fields of actors. I haven’t talked much about fields, but they work the same as fields in a struct would.

actor C {
    let x: Int32

    behave main() {
        this.x = 4
        foo()
    }

    fun foo() {
        this.x = 13
    }
}

The scheduler and worker threads guarantee there will only be one behaviour running on an actor at a time, meaning mutating a field of an actor is safe from any behaviour or function. foo in this example can’t be called from another actor because it modifies x. If we want a function to be usable by all actors, we can make it a top level declaration.

fun foo(): Int32 {
    return 13
}

actor C {
    let x: Int32

    behave main() {
        this.x = foo()
    }
}

foo can now be called from within any actor at any time.

Outro

If the ideas I have laid out here peak your interest, I would love to talk to you on the r/ProgrammingLanguages Discord Server! I have a lot of work to do, but I believe Alox can be a useful language that changes your perspective on what makes a programming language “fast”.