While the single-node Arroyo cluster is useful for testing and development, it is not suitable for production. This page describes how to run a production-ready distributed Arroyo cluster using Arroyo’s built-in scheduler or Kubernetes.

Before attempting to run a cluster, you should familiarize yourself with the Arroyo architecture. We are also happy to support users rolling out their own clusters, so please reach out to us at support@arroyo.systems or on Discord with any questions.

Common Setup

Database

The Arroyo control plane relies on a database to store its configuration and metadata (like the set of existing tables, the pipelines that are meant to be running, etc.) and to power the API and Web UI. As of 0.11, two databases are supported: Sqlite and Postgres.

Sqlite is recommended for local use and single-node deployments, while postgres should be used for scaled-out production deployments on Kubernetes.

See the database configuration options.

Storage

You will need a place to store pipeline artifacts (binaries) and checkpoint data. This needs to be accessible to all nodes in your cluster, including Arroyo the control plane and pipeline workers. Arroyo supports several storage backends, including S3, GCS, ABS, and local filesystem. For local testing, a filesystem that’s3 mounted on all nodes is sufficient, but for production you will likely want to use an object store like S3 or GCS. We also support S3-compatible object stores like MinIO and Localstack; endpoints can be set via the s3:: prefix or the AWS_ENDPOINT_URL environment variable.

The storage backend is configured by the following config properties:

  • checkpoint-url (env var: ARROYO__CHECKPOINT_URL) configures where checkpoints are written; for high-availability this should be an object store, but may be a local directory for testing and developing
  • compiler.artifact-url (env var: ARROYO__COMPILER__ARTIFACT_URL) controls where compiled UDF libs are stored

The values for these variables are URLs that specify the storage location. We support a number of ways of specifying these, for example:

  • s3://my-bucket/key/path
  • s3::https://my-custom-s3:1234/my-bucket/key/path
  • https://s3.us-east-1.amazonaws.com/my-bucket
  • file:///my/local/filesystem
  • /my/local/filesystem
  • gs://my-gcs-bucket