While the single-node Arroyo cluster is useful for testing and development, it is not suitable for production. This page describes how to run a production-ready distributed Arroyo cluster using Arroyo’s built-in scheduler or Kubernetes.
Before attempting to run a cluster, you should familiarize yourself with the Arroyo architecture. We are also happy to support users rolling out their own clusters, so please reach out to us at email@example.com or on Discord with any questions.
Arroyo relies on a postgres database to store configuration data and metadata. You will need to create a database
(by default called
arroyo, but this can be configured).
You will need a place to store pipeline artifacts (binaries) and checkpoint data. This needs to be accessible to
all nodes in your cluster, including Arroyo services (controller, compiler service) and pipeline workers. Arroyo
supports several storage backends, including S3, GCS, and local filesystem. For local testing, a filesystem that’s3
mounted on all nodes is sufficient, but for production you will likely want to use an object store like S3 or GCS.
We also support S3-compatible object stores like MinIO and Localstack; endpoints can be set via the
AWS_ENDPOINT_URL environment variable.
The storage backedn is configured via two environment variables:
ARTIFACT_URLcontrols where pipeline artifacts (i.e., binaries) are stored; this needs to be set on the compiler service if using, or the controller if not
CHECKPOINT_URLcontrols where checkpoint data is stored; this needs to be set on the controller
The values for these variables are URLs that specify the storage location. We support a number of ways of specifying these, for example:
The Arroyo Web UI can show job metrics to help monitor job progress. To enable this, you will need to set up a Prometheus server. See the prometheus documentation for more details.