VMs and bare-metal
Setting up an Arroyo cluster on VMs or bare-metal servers
This document will cover how to run an Arroyo cluster on Linux VMs or bare-metal server, like EC2 instances. This requires a good understanding of the Arroyo architecture. For an easier approach to running a production-quality Arroyo cluster, see the docs for running on top of Kubernetes.
Before starting this guide, follow the common setup steps in the deployment overview guide.
There are two options for running Arroyo on VMs; you may run the arroyo binary directly, or use Docker.
Binaries for release versions are available for Linux (x86 and ARM) and MacOS (M1) on the Github Releases page, or you can build your own binaries by following the dev guide.
If building your own binary, make sure to use the
--release
flag when calling cargo build
. You may also want to build on a machine with the same CPU
as you plan to deploy with and use the env var RUSTFLAGS='-C target-cpu=native'
to get the best performance
at the cost of portability to CPUs with different micro-architectures.
Alternatively, you can run the arroyo docker image, ghcr.io/arroyosystems/arroyo:latest
.
Running the migrations
As covered in the dev setup, you will need to run the database migrations on your database before starting the cluster. Currently only Postgres is supported.
By default, Arroyo will expect a database called arroyo
, a user account arroyo
with password arroyo
at localhost:5432
. These can be configured via the following environment variables:
DATABASE_NAME
DATABASE_HOST
DATABASE_PORT
DATABASE_USER
DATABASE_PASSWORD
$ arroyo-bin migrate
Running the cluster
The Arroyo cluster can run in two modes on VMs; either as a single-node cluster using the process scheduler, or as a distributed cluster using the node scheduler. The former is simpler, but cannot scale horizontally.
Additionally, you may decide to run all Arroyo services together in a single process, or as a separate processes for high-availability. This guide will only cover the former; for guidance on more complex deployments please reach out to the dev team on Discord or at support@arroyo.systems.
Configuration
In addition to the database configs described in the migration section, there are several other configuration options that you may wish to set via environment variables, including CHECKPOINT_URL and ARTIFACT_URL. Note that for a distributed cluster, those must be set to remote storage that is accessible by all nodes in the cluster.
Arroyo services
The entire Arroyo control plane can be run as a single process with the cluster subcommand:
$ arroyo-bin cluster
For high-availability, this should be managed by a process manager like systemd.
Schedulers
Arroyo ships with two schedulers that can be used for
VM deployments: process
and node
.
Process scheduler
The process scheduler is the default. It runs pipelines by spawning new processes on the same host as the control plane. This is great for simple, single-node deployments as no other infrastructure is required.
To use the process scheduler, run the control plane with SCHEDULER=process
or with no SCHEDULER
configuration.
Node scheduler
The node scheduler supports running a distributed Arroyo cluster without
requiring Kubernetes or another complex distributed runtime. An Arroyo node cluster is
made up on some number of hosts running the node
process, which are able to schedule work,
and a control plane running with SCHEDULER=node
.
A node can be run via the arroyo-bin binary or Docker image:
$ CONTROLLER_ADDR=http://localhost:9190 arroyo-bin node
Replace the CONTROLLER_ADDR
configuration with the host and port that the controller is running on.
Note that the node should always be run within a process manager that restarts it, like systemd; nodes are designed to restart when they lose connection to the controller.
Nodes can be configured with a given number of slots, via the SLOTS_PER_NODE
environment variable. This controls how many parallel subtasks can run on that node;
typically, you would want to set this to the number of CPUs but this can be somewhat
hardware and workload dependent.
Prometheus
Prometheus is required for the Web UI metrics support. All Arroyo services run a
Prometheus exporter on their admin HTTP port (8001 for the API, and 9191 for the
controller) at /metrics
.
The workers rely on the Prometheus pushgateway to produce metrics. You will need
to run a pushgateway instance on the nodes that run the Arroyo workers at the
default endpoint of localhost:9091
, and you will need to configure your
prometheus instance to scrape the pushgateways.