Deploying to EC2
This document will cover how to run an Arroyo cluster on raw EC2 instances. This requires a good understanding of the Arroyo architecture. For an easier approach to running a production-quality Arroyo cluster, see the docs for running on top of nomad. Kubernetes support is also coming soon.
Before starting this guide, follow the common setup steps in the deployment overview guide.
We don’t currently distribute binaries for Arroyo, so you will need to build the binaries yourself. Follow the dev setup guide to learn how.
Running the migrations
As covered in the dev setup, you will need to run the database migrations on your prod database before starting the services.
We use refinery to manage migrations. To run the migrations on your database, run these commands from your checkout of arroyo:
$ cargo install refinery_cli
$ refinery setup # follow the directions, configuring for your prod database
$ refinery migrate -p arroyo-api/migrations
Running the services
There are two options for running a distributed cluster. You can either use Arroyo’s built-in scheduler and nodes, or you can use nomad. Nomad is currently the recommended option, for production usecases. The Arroyo services can be run via Nomad, or separately on VMs.
Arroyo Services
An Arroyo cluster consists of more or more arroyo-api process and a single arroyo-controller process. This can be run however you like, and may be run on a single machine or on multiple machines. To achieve high-availability on the API layer, you will need to run multiple instances behind a load balancer (such as an ALB).
The arroyo-api server exposes port 8000 publicly, serving the frontend page as well as the REST API.
If the API and controller are not running on the same machine, the API needs to be configured with the endpoint of
the controller’s gRPC API via the CONTROLLER_ADDR
environment variable. By default, the controller runs its gRPC API
on port 9190. If the controller’s hostname is arroyo-controller.int
then the API would be configured with
CONTROLLER_ADDR=http://arroyo-controller.int:9190
.
Both arroyo-api and arroyo-controller additionally need to be configured with the database connection information via the following environment variables:
DATABASE_NAME
DATABASE_HOST
DATABASE_USER
DATABASE_PASSWORD
You will also need to configure the controller and compiler service with ARTIFACT_URL
and CHECKPOINT_URL
as described
in the overview.
Prometheus
Prometheus is required for the Web UI metrics support. All Arroyo services run a Prometheus exporter on their admin HTTP
port (8001 for the API, and 9191 for the controller) at /metrics
.
The workers rely on the Prometheus pushgateway to produce metrics. You will need to run a pushgateway instance on the
nodes that run the Arroyo workers at the default endpoint of localhost:9091
, and you will need to configure your
prometheus instance to scrape the pushgateways.
Schedulers
arroyo-node
Arroyo comes with a simple option for running clusters, via the arroyo-node binary. This can be run however you choose
(for example on raw ec2 instances or within docker). The processes need to be configured with the controller’s gRPC
endpoint via CONTROLLER_ADDR
, for example CONTROLLER_ADDR=http://arroyo-controller.int:9190
.
As mentioned above, you will also need to run a prometheus pushgateway on the nodes that run the workers.
To use the arroyo-node scheduler, the controller should be run with SCHEDULER=node
,
Nomad
Nomad is well supported as a scheduler, and provides a number of benefits including a robust foundation, a web UI, and resource isolation via cgroups. It is the recommended option for production use.
To use Nomad as the scheduler, start the controller with SCHEDULER=nomad
. The controller will also need to be
configured with the nomad API endpoint via the NOMAD_ENDPOINT
environment variable. This can point to either a single
nomad server, or a load balancer that points to multiple servers for high availability.