jado🏔
Modules, Packages, & Versions
Discussion of how we can improve dependency management.
2018·06·10
· language design
Package management is a difficult problem to solve for programming languages.
We’ve already seen disasters like NPM, where the starter template for a project
can be upwards of 30MB. A “Hello World” executable in C can be around
4 KB. Size aside, the amount of garbage packages on NPM is infinite. It’s
absolutely insane how a ridiculously simple package like left-pad
could
be pulled down and then everything breaks. (The Java ecosystem sure as hell
doesn’t have this problem.)
What are some problems that we can solve when creating a new package manager?
- Versioning (Go: lol no versions)
- Distributed (No relying on GitHub URLs for everything)
- Security (Do we know what we’re downloading?)
- Size (JS is a good example of how not to do it)
What are some things we can do to solve these problems?
Versioning
Elm is a great language that has a great solution to versioning. Instead of leaving versioning to incompetent humans, the package manager looks at the public API from a package and bumps the version based on the SemVer guidelines.
The rules are:
Many people use version numbers in different ways, making it hard to give reliable version bounds in your own package. With
elm-package
versions are determined based on API changes.
Versions all have exactly three parts:
MAJOR.MINOR.PATCH
All packages start with initial version 1.0.0
Versions are incremented based on how the API changes:
PATCH
- the API is the same, no risk of breaking codeMINOR
- values have been added, existing values are unchangedMAJOR
- existing values have been changed or removed
elm-package
will bump versions for you, automatically enforcing these rules
This completely takes versioning out of the hands of maintainers and helps promote stable APIs.
Personally, I would allow users to create packages before 1.0.0
with the following
version bump guidelines:
PATCH
: the API is the sameMINOR
: API changes in any wayMAJOR
: always0
in this scenario
This would allow users to distribute packages that don’t have stable APIs yet. Again,
this would only be for packages that haven’t reached 1.0.0
. After 1.0.0
, package
version bumping would follow SemVer.
Distribution
Instead of having one giant package repository or only allow links from GitHub, the package manager should accept any tarball. Here’s how one would add a dependency:
{
"group" : "phase",
"name" : "right-pad",
"version": "1.5.6",
"hash" : "561feb4504dc739a59557d1ffa7e14f7"
}
Global namespace package names are a terrible idea, as we know from the NPM Kik situation. You might notice an important addition, the hash. This handles our problem of security.
Let’s assume our tarball has these files:
meta.json
: provides information about the package, equivalent to the json above.src
: our source files that are either compiled objects or source files, depending on the language.
We then download the tarball from the specified provider, get the hash of it and make sure it matches ours.
Providers
Providers like GitHub or GitLab should be supported by default, but allowing users to host packages on their own sites should be supported too. Here’s what it could look like (I am not advocating for JSON but it’s quick to type).
{
"dependencies": [
{
"group" : "phase",
"name" : "right-pad",
"version" : "1.5.6",
"hash" : "561feb4504dc739a59557d1ffa7e14f7",
"provider": "github"
},
{
"group" : "google",
"name" : "guava",
"version" : "2.4.5",
"hash" : "e4da3b7fbbce2345d7772b0674a318d5",
"provider": "https://opensource.google.com/packages/"
}
]
}
In the second example, the “provider url” would resolve to something like
https://opensource.google. com/packages/guava/2.4.5/ e4da3b7fbbce2345d7772b0674a318d5.tgz
,
while the first example would find the tarball through github releases.
Providing the hash secures our download and the auto incrementing API asserts that our code can use the API the same way.
Size
What can we do about size? A lot of the problem has to do with users. Having 60 dependencies of insanely small packages will only increase size. Having files zipped up can help mitigate this, and a cache in the users hoome directory can help from downloading the same package multiple times. Here’s an example directory of what the cache would contain:
- ~/.lang/packages/
- phase/
- right-pad/
- phase_right-pad_1.5.6_561feb4504dc739a59557d1ffa7e14f7.tgz
- phase_right-pad_1.5.7_03c7c0ace395d80182db07ae2c30f034.tgz
- phase_right-pad_1.6.0_e358efa489f58062f10dd7316b65649e.tgz
- right-pad/
- google/
- guava/
- google_guava_2.4.5_e4da3b7fbbce2345d7772b0674a318d5.tgz
- llvm-bindings/
- google_llvm-bindings_1.0.0_9a1158154dfa42caddbd0694a4e9bdc8.tgz
- guava/
- phase/
When you download dependencies, the hashes will be checked, and for release builds it can check all the hashes again to make sure your system hasn’t been compromised.
Modules
Now that we have a solution for the distribution of code, it’s time to deal with the code itself. My theoretical language will contain module files that contain function and type declarations. I’ll use C syntax for now.
// a.lang
struct A {
int x;
int y;
}
int getX(A* a) {
return a->x;
}
// b.lang
struct B {
int w;
int z;
}
int getW(B* b) {
return b->w;
}
Now that we’ve defined an environment, we think about ways to retrieve these values from a separate file.
One way to retrieve values is to explicitly state where an identifier comes from. Here, I prefix identifiers from other modules with the module name.
// c.lang
// allows access through :: operator
import a;
import b;
int addXAndW(a::A* a, b::B* b) {
return a->x + b->w;
}
This solution introduces an import
keyword and a ::
operator. Two characters is
rather verbose for an operator that would be used all over the place.
Another solution is to import everything into the global namespace.
// c.lang
// imports everything into the global namespace
import a;
import b;
int addXAndW(A* a, B* b) {
return a->x + b->w;
}
Personally, I dislike this because it adds clutters up the global namespace, and importing a big library, like the standard library, would take up a lot of names.
// c.lang
// imports everything into the global namespace
import a;
import b;
// *our* library's version of A that adds specific things
struct A {
int x;
int y;
int newField;
}
A* convert(A* a) {
// .... wait... what?
// the name has been shadowed!
}
Instead of having to choose, I propose two ways of importing.
// c.lang
// imports everything from b into the global namespace
// and allows us to access things from a using ::
reference a; // (keyword name temporary)
import b;
// *our* library's version of A that adds specific things
struct A {
int x;
int y;
int newField;
}
a::A* convert(A* a) {
a::A* otherA = new a::A();
otherA->x = a->x;
otherA->y = a->y;
return otherA;
}
This allows us to include things into the global namespace and allow for explicit referencing.
Namespace aliasing would also be a good idea. Let’s say our a
module is in
std::io::test::a
. This is far too much code to type every time we want to
reference the module.
// c.lang
reference std::io::test::a as a; // (keyword name temporary)
import b;
// ...
a::A* convert(A* a) {
// ...
}
As for the syntax, reference
and ::
are rather verbose. We could completely
get rid of global importing and make this module referencing the only way to import,
freeing up the import keyword for us to use. As for ::
, we could replace it with
one character like #
or @
. I like [email protected]
because it says “the definition of A
is located at a
“.
// c.lang
import std@io@test@a as a;
int getX(a@A* a) {
return a->x;
}
It’s a little esoteric but we’re getting somewhere. The exact symbol doesn’t matter
as much as the semantics do. Rust uses use
and ::
. use
is a little weird, but
I guess that’s because I’m used to JVM languages. I do like the semantics of use
though.
I hope you enjoyed my thoughts on modules, packages, and versioning. These will probably be implemented in whatever language I decide to make next.