Container orchestrators take Docker containers + configuration, and run them on servers. If you’re ok being locked to a single cloud provider, most of them have a service which handles this for you.
At CodeDay, we need to run our own for a few reasons. In general, the cost is lower for our workloads. I’ve also found that managed offerings have usually been painful in the long-run because the way they bill for usage doesn’t make sense for all workloads. (In the last startup I ran, we overpaid by $5k/mo vs rolling our own, although that’s less than SF rent, so as a YC and venture-backed startup it was fine I guess.)
One more reason that was unique to us: as a non-profit, we have a few thousand dollars a year of free credit for both AWS and Azure, plus free colocation in the Westin Building donated by Green House Data, and we wanted to take advantage of it all.
If you look at running your own infrastructure, the usual story you’ll hear about orchestrators is that there was a fight between Docker Swarm and Kubernetes (and a few weird options like Flynn), but at the end of the day, Kubernetes was the winner.
Kubernetes is… not ideal for a small team. I’m sure it’s a great platform for companies the size of Google or Slack, but it brings a lot of complexity: etcd, helm charts, ingress controllers, endpoints and endpoint slices, istio, four types of coordinators (scheduler, kube controller manager, cloud controller manager, and api server), and the list goes on.
Distributions like MicroK8s can help with setup, but our team still needs to understand all what’s going on, or we won’t be able to tune, or troubleshoot problems.
When we were evaluating this, I talked with a few of our alums who’ve done Kubernetes at scale, and the recommendation was universally: do not do this without a full-time compute team which is… impractical for a nonprofit. I recently heard the same advice from Coinbase and a few HN commenters.
If Kubernetes is at one end of the service-oriented spectrum, Nomad is at the other. Nomad is a single binary with no external dependencies, written in Go.
To use Nomad, you download the binary to a VM and run
nomad server to set up the orchestrator. In most cases this VM can be very small — we use an Azure B1s for about $6/mo. If you want high availability, you can run three of them. (We do.)
Once your orchestrator is running, you download the same binary and run
nomad agent on the worker VMs which run containers. There’s a little bit of configuration, but our config files are on the order of 20 lines.
Containers in Nomad are part of “jobs,” and grouped in “task groups” (groups of containers which will run on the same machine). A job can ask for more than one copy of a container and you can PATCH them later, so you could write your own autoscaler (although Nomad has one built in).
You configure jobs by submitting JSON files, or anything that can compile down to JSON files. A lot of people use Hashicorp’s HCL language, although I find it very verbose. You can submit these with the
nomad CLI tool, or POST them directly to Nomad’s API. (We wrote a utility which lets you write jobs in a
docker-stack style YAML and submits the proper JSON.)
You can see everything, make some modifications, view logs, and even open a terminal into a container from this fancy web UI:
Honestly, that’s about all there is to know to get started. There are two tutorials which are helpful if you’d like to give it a try:
Hashicorp has two 100% optional services which are also just as simple to install and configure. Both of them also integrate nicely with Nomad:
elastic, and then use
elastic.service.consulin connection strings to automatically find an available Elastic host. (This sadly required a bit more setup that I would have liked.)
We actually run the Consul and Vault services on the same B1s VMs as the Nomad servers, and we’ve never really gotten above 50% CPU or RAM utilization, even at peak times when we’ve had two dozen agents and almost a hundred containers starting at the same time for CodeCup, our series of hour-long high school cybersecurity challenges.
The last service we find useful is not made by Hashicorp: Traefik. It’s a load balancer that can look up information about running containers in Consul. It can also talk to Lets Encrypt and do TLS termination, plus a bunch of complicated things.
We just add a tag to each port in the job description to tell Traefik to route traffic for a specific domain to that port on those containers. Honestly Traefik became overly complicated once they decided to do an enterprise offering and I wish there were a simpler alternative.
The thing I love about Nomad is that it was very easy to get started with, but it’s been able to do most of the advanced things we’ve needed over time. It’s honestly a breath of fresh air compared to a lot of modern tooling, and it reminds me of when I was first learning tools like Vim.
Here are a few features you can look into if you need them:
(Good time to mention that this is not a sponsored post, just our experience.)
We have more than 50 services running at this point to support our programs, and everyone has been super happy with the choice. One of our interns even deployed this stack on his Raspberry Pis for some projects, recently.
Over time we’ve developed some more tooling around Nomad, such as automated provisioning of new machines using Cloud-Init and Puppet, a utility to send Discord notifications for deployment status updates, and a web UI for managing deployed versions and scaling out: