How and why to use Nomad for orchestration at your startup
Container orchestrators take Docker containers + configuration, and run them on servers. If you’re ok being locked to a single cloud provider, most of them have a service which handles this for you.
At CodeDay, we need to run our own for a few reasons. In general, the cost is lower for our workloads. I’ve also found that managed offerings have usually been painful in the long-run because the way they bill for usage doesn’t make sense for all workloads. (In the last startup I ran, we overpaid by $5k/mo vs rolling our own, although that’s less than SF rent, so as a YC and venture-backed startup it was fine I guess.)
One more reason that was unique to us: as a non-profit, we have a few thousand dollars a year of free credit for both AWS and Azure, plus free colocation in the Westin Building donated by Green House Data, and we wanted to take advantage of it all.
Why Not Kubernetes?
If you look at running your own infrastructure, the usual story you’ll hear about orchestrators is that there was a fight between Docker Swarm and Kubernetes (and a few weird options like Flynn), but at the end of the day, Kubernetes was the winner.
Kubernetes is… not ideal for a small team. I’m sure it’s a great platform for companies the size of Google or Slack, but it brings a lot of complexity: etcd, helm charts, ingress controllers, endpoints and endpoint slices, istio, four types of coordinators (scheduler, kube controller manager, cloud controller manager, and api server), and the list goes on.
Distributions like MicroK8s can help with setup, but our team still needs to understand all what’s going on, or we won’t be able to tune, or troubleshoot problems.
When we were evaluating this, I talked with a few of our alums who’ve done Kubernetes at scale, and the recommendation was universally: do not do this without a full-time compute team which is… impractical for a nonprofit. I recently heard the same advice from Coinbase and a few HN commenters.
Why Nomad (+Hashicorp’s Stack) is Better
If Kubernetes is at one end of the service-oriented spectrum, Nomad is at the other. Nomad is a single binary with no external dependencies, written in Go.
To use Nomad, you download the binary to a VM and run
nomad server to set up the orchestrator. In most cases this VM can be very small — we use an Azure B1s for about $6/mo. If you want high availability, you can run three of them. (We do.)
Once your orchestrator is running, you download the same binary and run
nomad agent on the worker VMs which run containers. There’s a little bit of configuration, but our config files are on the order of 20 lines.
Containers in Nomad are part of “jobs,” and grouped in “task groups” (groups of containers which will run on the same machine). A job can ask for more than one copy of a container and you can PATCH them later, so you could write your own autoscaler (although Nomad has one built in).
You configure jobs by submitting JSON files, or anything that can compile down to JSON files. A lot of people use Hashicorp’s HCL language, although I find it very verbose. You can submit these with the
nomad CLI tool, or POST them directly to Nomad’s API. (We wrote a utility which lets you write jobs in a
docker-stack style YAML and submits the proper JSON.)
You can see everything, make some modifications, view logs, and even open a terminal into a container from this fancy web UI:
Honestly, that’s about all there is to know to get started. There are two tutorials which are helpful if you’d like to give it a try:
More Helpful, Simple Services
Hashicorp has two 100% optional services which are also just as simple to install and configure. Both of them also integrate nicely with Nomad:
- Consul is a suite of tools that help services discover other services. It’s optional, but if you also install Consul, you can tag ports on your containers with names, and access them from other containers.
(Consul can also do some other things, like status checks, secure service-to-service routing, and distributed key-value store.)
We use Consul so that we can, eg., tag our Elastic containers ports with
elastic, and then use
elastic.service.consulin connection strings to automatically find an available Elastic host. (This sadly required a bit more setup that I would have liked.)
We also use one of those advanced features — status checks — to wait for services to fully initialize after a deploy before switching over live traffic.
- Vault is a secret management system. At the simplest level, you can use it to store encrypted K/V pairs, which you can then inject into config files or ENV variables in Nomad jobs.
We actually run the Consul and Vault services on the same B1s VMs as the Nomad servers, and we’ve never really gotten above 50% CPU or RAM utilization, even at peak times when we’ve had two dozen agents and almost a hundred containers starting at the same time for CodeCup, our series of hour-long high school cybersecurity challenges.
The last service we find useful is not made by Hashicorp: Traefik. It’s a load balancer that can look up information about running containers in Consul. It can also talk to Lets Encrypt and do TLS termination, plus a bunch of complicated things.
We just add a tag to each port in the job description to tell Traefik to route traffic for a specific domain to that port on those containers. Honestly Traefik became overly complicated once they decided to do an enterprise offering and I wish there were a simpler alternative.
Growing Into Advanced Features
The thing I love about Nomad is that it was very easy to get started with, but it’s been able to do most of the advanced things we’ve needed over time. It’s honestly a breath of fresh air compared to a lot of modern tooling, and it reminds me of when I was first learning tools like Vim.
Here are a few features you can look into if you need them:
- You can start containers using a crontab-style system, for backups, and run “system” jobs which run on all nodes.
- You can use Nomad to run native processes, QEMU VMs, or… JARs (?)
- You can tag Nomad agents, and use those tags to specify placement affinities to e.g.: colocate network-sensitive services, add DC redundancy, deploy services at POPs, etc.
- “CSI” allows Nomad to automatically talk to your cloud provider to mount/dismount/move volumes on hosts, for things like databases requiring high-performance persistent storage. (It’s compatible with all Kubernetes CSI providers.)
- You can configure autoscaling policies for your containers using prometheus metrics. (And, of course, Nomad can publish metrics so you can scale up and down your servers.)
- By default anyone with access to port 4646 on your Nomad servers can modify jobs, but you can configure detailed ACL tokens to control access.
- Consul Connect allows you to define specific service-to-service connections.
(Good time to mention that this is not a sponsored post, just our experience.)
We have more than 50 services running at this point to support our programs, and everyone has been super happy with the choice. One of our interns even deployed this stack on his Raspberry Pis for some projects, recently.
Over time we’ve developed some more tooling around Nomad, such as automated provisioning of new machines using Cloud-Init and Puppet, a utility to send Discord notifications for deployment status updates, and a web UI for managing deployed versions and scaling out: