Overview
Running a distributed Arroyo cluster
While the single-node Arroyo cluster is useful for testing and development, it is not suitable for production. This page describes how to run a production-ready distributed Arroyo cluster using Arroyo’s built-in scheduler or Kubernetes.
Before attempting to run a cluster, you should familiarize yourself with the Arroyo architecture. We are also happy to support users rolling out their own clusters, so please reach out to us at support@arroyo.systems or on Discord with any questions.
Common Setup
Database
The Arroyo control plane relies on a database to store its configuration and metadata (like the set of existing tables, the pipelines that are meant to be running, etc.) and to power the API and Web UI. As of 0.11, two databases are supported: Sqlite and Postgres.
Sqlite is recommended for local use and single-node deployments, while postgres should be used for scaled-out production deployments on Kubernetes.
See the database configuration options.
Storage
You will need a place to store pipeline artifacts (binaries) and checkpoint data. This needs to be accessible to
all nodes in your cluster, including Arroyo the control plane and pipeline workers. Arroyo
supports several storage backends, including S3, GCS, ABS, and local filesystem. For local testing, a filesystem that’s3
mounted on all nodes is sufficient, but for production you will likely want to use an object store like S3 or GCS.
We also support S3-compatible object stores like MinIO and Localstack; endpoints can be set via the s3::
prefix
or the AWS_ENDPOINT_URL
environment variable.
The storage backend is configured by the following config properties:
checkpoint-url
(env var:ARROYO__CHECKPOINT_URL
) configures where checkpoints are written; for high-availability this should be an object store, but may be a local directory for testing and developingcompiler.artifact-url
(env var:ARROYO__COMPILER__ARTIFACT_URL
) controls where compiled UDF libs are stored
The values for these variables are URLs that specify the storage location. We support a number of ways of specifying these, for example:
s3://my-bucket/key/path
s3::https://my-custom-s3:1234/my-bucket/key/path
https://s3.us-east-1.amazonaws.com/my-bucket
file:///my/local/filesystem
/my/local/filesystem
gs://my-gcs-bucket