jadon

polyglot plt hacker

Home · Blog · GitHub · Twitter · Email


Modules, Packages, & Versions

Published Sunday, June 10, 2018 · 1863 words, 10 minute read.

Package management is a difficult problem to solve for programming languages. We’ve already seen disasters like NPM, where the starter template for a project can be upwards of 30MB. A “Hello World” executable in C can be around 4 KB. Size aside, the amount of garbage packages on NPM is infinite. It’s absolutely insane how a ridiculously simple package like left-pad could be pulled down and then everything breaks. (The Java ecosystem sure as hell doesn’t have this problem.)

What are some problems that we can solve when creating a new package manager?

What are some things we can do to solve these problems?


Versioning

Elm is a great language that has a great solution to versioning. Instead of leaving versioning to incompetent humans, the package manager looks at the public API from a package and bumps the version based on the SemVer guidelines.

The rules are:

Many people use version numbers in different ways, making it hard to give reliable version bounds in your own package. With elm-package versions are determined based on API changes.

  • Versions all have exactly three parts: MAJOR.MINOR.PATCH

  • All packages start with initial version 1.0.0

  • Versions are incremented based on how the API changes:

    • PATCH - the API is the same, no risk of breaking code
    • MINOR - values have been added, existing values are unchanged
    • MAJOR - existing values have been changed or removed
  • elm-package will bump versions for you, automatically enforcing these rules

This completely takes versioning out of the hands of maintainers and helps promote stable APIs.

Personally, I would allow users to create packages before 1.0.0 with the following version bump guidelines:

This would allow users to distribute packages that don’t have stable APIs yet. Again, this would only be for packages that haven’t reached 1.0.0. After 1.0.0, package version bumping would follow SemVer.


Distribution

Instead of having one giant package repository or only allow links from GitHub, the package manager should accept any tarball. Here’s how one would add a dependency:

{
    "group"  : "phase", 
    "name"   : "right-pad", 
    "version": "1.5.6", 
    "hash"   : "561feb4504dc739a59557d1ffa7e14f7"
}

Global namespace package names are a terrible idea, as we know from the NPM Kik situation. (Come on, who the hell thought that was a good idea? I don’t mean to offend anyone, but there are so many awful design decisions in NPM.) You might notice an important addition, the hash. This handles our problem of security.

Let’s assume our tarball has these files:

We then download the tarball from the specified provider, get the hash of it and make sure it matches ours.

Providers

Providers like GitHub or GitLab should be supported by default, but allowing users to host packages on their own sites should be supported too. Here’s what it could look like (I am not advocating for JSON but it’s quick to type).

{
    "dependencies": [
        {
            "group"   : "phase",
            "name"    : "right-pad",
            "version" : "1.5.6",
            "hash"    : "561feb4504dc739a59557d1ffa7e14f7",
            "provider": "github"
        },
        {
            "group"   : "google",
            "name"    : "guava",
            "version" : "2.4.5",
            "hash"    : "e4da3b7fbbce2345d7772b0674a318d5",
            "provider": "https://opensource.google.com/packages/"
        }
    ]
}

In the second example, the “provider url” would resolve to something like https://opensource.google. com/packages/guava/2.4.5/ e4da3b7fbbce2345d7772b0674a318d5.tgz, while the first example would find the tarball through github releases.

Providing the hash secures our download and the auto incrementing API asserts that our code can use the API the same way.


Size

What can we do about size? A lot of the problem has to do with users. Having 60 dependencies of insanely small packages will only increase size. Having files zipped up can help mitigate this, and a cache in the users hoome directory can help from downloading the same package multiple times (Still looking at you, NPM. Seriously, who the hello thought a local node_modules was a good idea? WHO?) Here’s an example directory of what the cache would contain:

When you download dependencies, the hashes will be checked, and for release builds it can check all the hashes again to make sure your system hasn’t been compromised.


Modules

Now that we have a solution for the distribution of code, it’s time to deal with the code itself. My theoretical language (I’ll probably implement this soon) will contain module files that contain function and type declarations. I’ll use C syntax for now.

// a.lang
struct A {
    int x;
    int y;
}

int getX(A* a) {
    return a->x;
}
// b.lang
struct B {
    int w;
    int z;
}

int getW(B* b) {
    return b->w;
}

Now that we’ve defined an environment, we think about ways to retrieve these values from a separate file.

One way to retrieve values is to explicitly state where an identifier comes from. Here, I prefix identifiers from other modules with the module name.

// c.lang

// allows access through :: operator
import a;
import b;

int addXAndW(a::A* a, b::B* b) {
    return a->x + b->w;
}

This solution introduces an import keyword and a :: operator. Two characters is rather verbose for an operator that would be used all over the place.

Another solution is to import everything into the global namespace.

// c.lang

// imports everything into the global namespace
import a;
import b;

int addXAndW(A* a, B* b) {
    return a->x + b->w;
}

Personally, I dislike this because it adds clutters up the global namespace, and importing a big library, like the standard library, would take up a lot of names.

// c.lang

// imports everything into the global namespace
import a;
import b;

// *our* library's version of A that adds specific things
struct A {
    int x;
    int y;
    int newField;
}

A* convert(A* a) {
    // .... wait... what?
    // the name has been shadowed!
}

Instead of having to choose, I propose two ways of importing.

// c.lang

// imports everything from b into the global namespace
// and allows us to access things from a using ::
reference a; // (keyword name temporary)
import b;

// *our* library's version of A that adds specific things
struct A {
    int x;
    int y;
    int newField;
}

a::A* convert(A* a) {
    a::A* otherA = new a::A();
    otherA->x = a->x;
    otherA->y = a->y;
    return otherA;
}

This allows us to include things into the global namespace and allow for explicit referencing.

Namespace aliasing would also be a good idea. Let’s say our a module is in std::io::test::a. This is far too much code to type every time we want to reference the module.

// c.lang

reference std::io::test::a as a; // (keyword name temporary)
import b;

// ...

a::A* convert(A* a) {
    // ...
}

As for the syntax, reference and :: are rather verbose. We could completely get rid of global importing and make this module referencing the only way to import, freeing up the import keyword for us to use. As for ::, we could replace it with one character like # or @. I like [email protected] because it says “the definition of A is located at a.

// c.lang

import std@io@test@a as a;

int getX(a@A* a) {
    return a->x;
}

It’s a little esoteric but we’re getting somewhere. The exact symbol doesn’t matter as much as the semantics do. Rust uses use and ::. use is a little weird, but I guess that’s because I’m used to JVM languages. I do like the semantics of use though.

I hope you enjoyed my thoughts on modules, packages, and versioning. These will probably implemented in whatever language I decide to make next.



Home · Blog · GitHub · Twitter · Email