Arroyo supports Kubernetes as both a scheduler (for running Arroyo pipeline tasks) and as as a deploy target for the Arroyo control plane. This is the easiest way to get a production quality Arroyo cluster running.

This guide assumes a working Kubernetes cluster. This may be a local installation (like minikube) or a cloud provider (like Amazon EKS or Google Kubernetes Engine). All stable Kubernetes versions are supported (currently >=1.25) but older versions will likely work as well.

A complete working Arroyo cluster involves a number of components (as described in more detail in the architecture guide):

  • arroyo-compiler
  • arroyo-controller
  • arroyo-api
  • prometheus
  • postgres

We provide a Helm chart to configure all of these components. You may choose to use an existing Postgres database and prometheus instance, or have the helm chart deploy cluster-specific instances.

Set up the Helm repository

You will first need to set up Helm locally. Follow the instructions here to get a working Helm installation.

Next you will need to add the Arroyo Helm repository to your local Helm installation:

$ helm repo add arroyo https://arroyosystems.github.io/helm-repo

Once this is installed, you should be able to see the Arroyo helm hart:

$ helm search repo arroyo
NAME         	CHART VERSION	APP VERSION	DESCRIPTION
arroyo/arroyo	0.6.0        	0.6.0      	Helm chart for the Arroyo stream processing engine

Configure the Helm chart

The Helm chart provides a number of options, which can be inspected by running

$ helm show values arroyo/arroyo

The most important options are:

  • postgresql.deploy: Whether to deploy a new Postgres instance. If set to false, the chart will expect a Postgres instance to be available with the connection settings determined by postgresql.externalDatabase configurations (by default: postgres://arroyo:arroyo@localhost:5432/arroyo).
  • prometheus.deploy: Whether to deploy a new Prometheus instance. If set to false, the chart will expect a Prometheus instance to be available at the URL determined by prometheus.endpoint (by default: http://localhost:9090). Prometheus is not required for operation of the system, but is needed for the Arroyo UI metrics to function. By default Arroyo expects you to have a scrape interval of 5s. If you have a higher scrape interval (for example, the default 1m) you will need to update the prometheus.queryRate configuration to at least 4x the scrape interval.
  • artifactUrl and checkpointUrl: Configures where pipeline artifacts and checkpoints are stored. See the overview for more details on how these are configured. If this is set to a local directory (when running a local k8s cluster), you will need to configure volumes and volumeMounts to make this directory available on all of the pods.
  • existingConfigMap allows you to set environment variables on the Arroyo pods.

The helm chart can be configured either via a values.yaml file or via command line arguments. See the Helm documentation for more details.

Example local configuration

To run on a local Kubernetes cluster without S3, you can use the following configuration:

artifactUrl: "/tmp/arroyo-test"
checkpointUrl: "/tmp/arroyo-test"

volumes:
  - name: checkpoints
    hostPath:
      path: /tmp/arroyo-test
      type: DirectoryOrCreate

volumeMounts:
  - name: checkpoints
    mountPath: /tmp/arroyo-test

Example EKS configuration

For a production deployment on EKS, you may want to use an external Postgres instance and S3 bucket. Assuming you have an existing RDS installation at arroyo.cnkkgnj5egvb.us-east-1.rds.amazonaws.com with a database named arroyo, and an S3 bucket named arroyo-artifacts in the us-east-1 region, you can use the following configuration:

postgresql:
  externalDatabase:
    host: db.prod.iad.arroyo.cluster
    name: arroyo_test
    user: arroyodb
    password: arroyodb

artifactUrl: "s3://arroyo-artifacts"
checkpointUrl: "s3://arroyo-checkpoints"

If you are using an external Postgres instance (for example one hosted in RDS) you will need to ensure that the pod template for your EKS cluster has a security group that allows access to the your RDS cluster. If not, you may see the Arroyo service pods hang on startup as they try to connect.

Example GKE configuration

For a production deployment on GKE, you may want to use an external Postgres instance and GCS bucket. You will need to give the pods access to the GCS bucket by creating a service account with the storage.objects.admin role and specifying the name of the service account in the helm chart configuration. See this guide for details on how to set up the permissions. The service account you create can then be configured in the helm chart with the serviceAccount value.

artifactUrl: "gs://arroyo-artifacts"
checkpointUrl: "gs://arroyo-checkpoints"

postgresql:
  externalDatabase:
    host: db.prod.iad.arroyo.cluster
    name: arroyo_test
    user: arroyodb
    password: arroyodb

serviceAccount:
  name: gke-access-gcs
  create: false

Installing the helm chart

Once you’ve created your configuration file values.yaml, you can install the helm chart:

$ helm install arroyo arroyo/arroyo -f values.yaml

This will install the helm chart into your Kubernetes cluster. You can check the status of the installation by running helm status arroyo. Once the installation is complete, you should see the following pods running:

$ kubectl get pods
NAME                                        READY   STATUS             RESTARTS      AGE
arroyo-compiler-ccd6b7bdb-752vt             1/1     Running            0             36s
arroyo-controller-75587f886b-k9drg          1/1     Running            1 (18s ago)   36s
arroyo-postgresql-0                         1/1     Running            0             26s
arroyo-api-5dccb89967-zl727                 1/1     Running            2 (17s ago)   36s
arroyo-prometheus-server-5c8d49b85d-xwl2h   2/2     Running            0             36s

(Note that if you’re deploying postgres, it may take a couple of minutes for all of the pods to get to running).

Accessing the Arroyo UI

Once everything is running, you should be able to access the Arroyo UI. If you’re running locally on linux, you can use connect directly to the pod with

$ open "http://$(kubectl get service/arroyo -o jsonpath='{.spec.clusterIP}')"

If you’re running on MacOS or in EKS, you can proxy the service to your local machine with

$ kubectl port-forward service/arroyo 8000:80

Then you can access the UI at https://localhost:8000.