Set up the Helm repository
You will first need to set up Helm locally. Follow the instructions here to get a working Helm installation. Next you will need to add the Arroyo Helm repository to your local Helm installation:Configure the Helm chart
The Helm chart provides a number of options, which can be inspected by runningpostgresql.deploy
: Whether to deploy a new Postgres instance. If set tofalse
, the chart will expect a Postgres instance to be available with the connection settings determined bypostgresql.externalDatabase
configurations (by default: postgres://arroyo:arroyo@localhost:5432/arroyo).artifactUrl
andcheckpointUrl
: Configures where pipeline artifacts and checkpoints are stored. See the overview for more details on how these are configured. If this is set to a local directory (when running a local k8s cluster), you will need to configurevolumes
andvolumeMounts
to make this directory available on all of the pods.existingConfigMap
allows you to set environment variables on the Arroyo pods.
values.yaml
file or via command line arguments. See the
Helm documentation for more details.
Configuring Arroyo
Arroyo has a rich configuration system, which is controlled through environment variables and config files. In Kubernetes, either option is available to control the behavior of the system. To set an environment variable, override theenv
field in the helm chart:
values.yaml
configmap.yaml
values.yaml
Local Kubernetes
To test out an Arroyo cluster or do development, it can be useful to run it on a local Kubernetes cluster (for example, minikube, k3s, Docker Desktop, etc.). In this mode, we can use a local filesystem mount for checkpoint storage, as we are running only on one node. (For distributed clusters, checkpoints must be stored on in a location that is accessible to all nodes, like S3). An example local configuration looks like this:Amazon EKS
For a production deployment on EKS, you may want to use an external Postgres instance and S3 bucket. Assuming you have an existing RDS installation atarroyo.cnkkgnj5egvb.us-east-1.rds.amazonaws.com
with a database named arroyo
, and
an S3 bucket named arroyo-artifacts
in the us-east-1
region, you can use the following configuration:
If you are using an external Postgres instance (for example one hosted in RDS) you will need to ensure that the
pod template for your EKS cluster has a security group that allows access to the your RDS cluster. If not, you may
see the Arroyo service pods hang on startup as they try to connect.
S3 Authentication
Arroyo pods need to be able to authenticate against your S3 bucket to write and restore checkpoints, as well as against any other AWS resources you would like to access (like Kinesis streams or IAM-secured Kafka clusters). This section covers several options.Static credentials
The easiest (but least secure) way to authenticate against S3 or other AWS services is to embed static credentials (an access key id/secret pair) as environment variables. Arroyo reads standard AWS environment variables. To embed credentials directly in the config, you can add this to your Helm configuration:values.yaml
values.yaml
Pod IAM Role
Another option is to configure the EKS Node IAM role to have access to the desired resource (like S3). This is also straightforward and is more secure than using static credentials, however it does require giving all workloads on the EKS cluster access. First, we create a policy:policy.json
my-s3-bucket
to the name of your s3 bucket)
<NODE_INSTANCE_ROLE>
with the ARN returned in the previous step.
IAM Roles for Service Accounts (IRSA)
The most secure way to configure this is to use EKS’s IRSA feature. This is a way to bridge IAM roles (AWS’s way of configuring permissions) with Service Accounts, which is Kubernetes’, and ensures that only the pods that need access to a particular resource are granted it. Full documentation for how to set up IRSA is beyond the scope of this guide; see the AWS docs for full details. Once you’ve created a policy and attached it to a service account (for example one calledarroyo-pod-sa
)
we can configure Arroyo to use it with this helm configuration:
values.yaml
Google GKE
For a production deployment on GKE, you may want to use an external Postgres instance and GCS bucket. You will need to give the pods access to the GCS bucket by creating a service account with thestorage.objects.admin
role and
specifying the name of the service account in the helm chart configuration. See
this guide
for details on how to set up the permissions. The service account you create can
then be configured in the helm chart with the serviceAccount
value.
Installing the helm chart
Once you’ve created your configuration filevalues.yaml
, you can install the helm chart:
helm status arroyo
. Once the installation is complete, you should see the following pods running: