Arroyo supports Kubernetes as both a scheduler (for running Arroyo pipeline tasks) and as as a deploy target for the
Arroyo control plane. This is the easiest way to get a production quality Arroyo cluster running.
This guide assumes a working Kubernetes cluster. This may be a local
installation (like minikube) or a cloud
provider (like Amazon EKS or
Google Kubernetes Engine). All
stable Kubernetes versions are supported (currently >=1.25) but older versions
will likely work as well.
A complete working Arroyo cluster involves a number of components (as described in more detail in the
architecture guide), including a controller pod and a postgres database.
We provide a Helm chart to configure all of these components. You may choose to
use an existing Postgres database, or have the helm chart deploy
cluster-specific instances.
Set up the Helm repository
You will first need to set up Helm locally. Follow the instructions here to get
a working Helm installation.
Next you will need to add the Arroyo Helm repository to your local Helm installation:
$ helm repo add arroyo https://arroyosystems.github.io/helm-repo
Once this is installed, you should be able to see the Arroyo helm hart:
$ helm search repo arroyo
NAME CHART VERSION APP VERSION DESCRIPTION
arroyo/arroyo 0.6.0 0.6.0 Helm chart for the Arroyo stream processing engine
The Helm chart provides a number of options, which can be inspected by running
$ helm show values arroyo/arroyo
The most important options are:
postgresql.deploy: Whether to deploy a new Postgres instance. If set to false, the chart will expect a Postgres
instance to be available with the connection settings determined by postgresql.externalDatabase configurations
(by default: postgres://arroyo:arroyo@localhost:5432/arroyo).
artifactUrl and checkpointUrl: Configures where pipeline artifacts and
checkpoints are stored. See the overview for
more details on how these are configured. If this is set to a local directory
(when running a local k8s cluster), you will need to configure volumes and
volumeMounts to make this directory available on all of the pods.
existingConfigMap allows you to set environment variables on the Arroyo pods.
The helm chart can be configured either via a values.yaml file or via command line arguments. See the
Helm documentation for more details.
Configuring Arroyo
Arroyo has a rich configuration system, which is controlled through environment variables
and config files. In Kubernetes, either option is available to control the behavior of the system.
To set an environment variable, override the env field in the helm chart:
env:
- name: ARROYO__DEFAULT_CHECKPOINT_INTERVAL
value: 5s
Configuring Arroyo via a config file is also possible. Create a config map like this:
apiVersion: v1
kind: ConfigMap
metadata:
name: my-arroyo-config
data:
config.yaml: |
default-checkpoint-interval: 5s
pipeline:
source-batch-size: 256
healthy-duration: 1m
Deploy it to the cluster:
$ kubectl create -f configmap.yaml
Then configure the Helm chart to use it:
existingConfigMap: my-arroyo-config
Local Kubernetes
To test out an Arroyo cluster or do development, it can be useful to run it on a local Kubernetes cluster
(for example, minikube, k3s, Docker Desktop, etc.). In this mode, we can use a local filesystem mount
for checkpoint storage, as we are running only on one node. (For distributed clusters, checkpoints must
be stored on in a location that is accessible to all nodes, like S3).
An example local configuration looks like this:
artifactUrl: "/tmp/arroyo-test"
checkpointUrl: "/tmp/arroyo-test"
volumes:
- name: checkpoints
hostPath:
path: /tmp/arroyo-test
type: DirectoryOrCreate
volumeMounts:
- name: checkpoints
mountPath: /tmp/arroyo-test
Amazon EKS
For a production deployment on EKS, you may want to use an external Postgres instance and S3 bucket. Assuming you have
an existing RDS installation at arroyo.cnkkgnj5egvb.us-east-1.rds.amazonaws.com with a database named arroyo, and
an S3 bucket named arroyo-artifacts in the us-east-1 region, you can use the following configuration:
postgresql:
deploy: false
externalDatabase:
host: arroyo-prod.cluster-dfg3fak5egvb.us-east-1.rds.amazonaws.com
name: arroyo_test
user: arroyodb
password: arroyodb
artifactUrl: "s3://arroyo-checkpoints/artifacts"
checkpointUrl: "s3://arroyo-checkpoints/checkpoints"
If you are using an external Postgres instance (for example one hosted in RDS) you will need to ensure that the
pod template for your EKS cluster has a security group that allows access to the your RDS cluster. If not, you may
see the Arroyo service pods hang on startup as they try to connect.
S3 Authentication
Arroyo pods need to be able to authenticate against your S3 bucket to write and restore
checkpoints, as well as against any other AWS resources you would like to access (like
Kinesis streams or IAM-secured Kafka clusters). This section covers several options.
Static credentials
The easiest (but least secure) way to authenticate against S3 or other AWS
services is to embed static credentials (an access key id/secret pair) as
environment variables. Arroyo reads standard AWS environment
variables.
To embed credentials directly in the config, you can add this to your Helm configuration:
env:
- name: AWS_ACCESS_KEY_ID
value: Z6EZ25YSK97EU5ZQ96MT2
- name: AWS_SECRET_ACCESS_KEY
value: lIEY87tBT3D7bGKxnM4IaxDAtmiORKDvqS3VMCwFO
However, this is quite insecure as the credentials will be visible in plaintext to anyone
who has access the the Kubernetes cluster. A better approach is to store the credentials
as a Kubernetes Secret.
We can create the secret like this:
$ kubectl create secret generic aws-creds \
--from-literal=aws-access-key-id='Z6EZ25YSK97EU5ZQ96MT2' \
--from-literal=aws-secret-access-key='lIEY87tBT3D7bGKxnM4IaxDAtmiORKDvqS3VMCwFO'
Then inject it into the configuration:
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-creds
key: aws-access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-creds
key: aws-secret-access-key
Pod IAM Role
Another option is to configure the EKS Node IAM role to have access to the desired resource (like
S3). This is also straightforward and is more secure than using static credentials, however it
does require giving all workloads on the EKS cluster access.
First, we create a policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::my-s3-bucket",
"arn:aws:s3:::my-s3-bucket/*"
]
}
]
}
(change my-s3-bucket to the name of your s3 bucket)
aws iam create-policy \
--policy-name ArroyoNodePolicy \
--policy-document file://policy.json
Next, we must find the ARN of the node group role:
$ aws eks describe-nodegroup \
--cluster-name <EKS_CLUSTER_NAME> \
--nodegroup-name <NODEGROUP_NAME> \
--query "nodegroup.nodeRole"
arn:aws:iam::773345601144:role/eks-node-role
Then we can attach the policy to the role:
$ aws iam attach-role-policy \
--role-name <NODE_INSTANCE_ROLE> \
--policy-arn arn:aws:iam::<ACCOUNT_ID>:policy/ArroyoNodePolicy
Replace <NODE_INSTANCE_ROLE> with the ARN returned in the previous step.
IAM Roles for Service Accounts (IRSA)
The most secure way to configure this is to use EKS’s
IRSA feature.
This is a way to bridge IAM roles (AWS’s way of configuring permissions) with Service Accounts,
which is Kubernetes’, and ensures that only the pods that need access to a particular resource are
granted it.
Full documentation for how to set up IRSA is beyond the scope of this guide; see the AWS docs
for full details.
Once you’ve created a policy and attached it to a service account (for example one called arroyo-pod-sa)
we can configure Arroyo to use it with this helm configuration:
serviceAccount:
create: false
name: arroyo-pod-sa
role:
create: false
Google GKE
For a production deployment on GKE, you may want to use an external Postgres
instance and GCS bucket. You will need to give the pods access to the GCS bucket
by creating a service account with the storage.objects.admin role and
specifying the name of the service account in the helm chart configuration. See
this guide
for details on how to set up the permissions. The service account you create can
then be configured in the helm chart with the serviceAccount value.
artifactUrl: "gs://arroyo-artifacts"
checkpointUrl: "gs://arroyo-checkpoints"
postgresql:
deploy: false
externalDatabase:
host: db.prod.iad.arroyo.cluster
name: arroyo_test
user: arroyodb
password: arroyodb
serviceAccount:
name: gke-access-gcs
create: false
Installing the helm chart
Once you’ve created your configuration file values.yaml, you can install the helm chart:
$ helm install arroyo arroyo/arroyo -f values.yaml
This will install the helm chart into your Kubernetes cluster. You can check the status of the installation by running
helm status arroyo. Once the installation is complete, you should see the following pods running:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
arroyo-compiler-ccd6b7bdb-752vt 1/1 Running 0 36s
arroyo-controller-75587f886b-k9drg 1/1 Running 1 (18s ago) 36s
arroyo-postgresql-0 1/1 Running 0 26s
arroyo-api-5dccb89967-zl727 1/1 Running 2 (17s ago) 36s
(Note that if you’re deploying postgres, it may take a couple of minutes for all of the pods to get to running).
Accessing the Arroyo UI
Once everything is running, you should be able to access the Arroyo UI.
If you’re running locally on linux, you can use connect directly to the pod with
$ open "http://$(kubectl get service/arroyo -o jsonpath='{.spec.clusterIP}')"
If you’re running on MacOS or in EKS, you can proxy the service to your local machine with
$ kubectl port-forward service/arroyo 5115:80
Then you can access the UI at https://localhost:5115.