Overview

Arroyo has a flexible and powerful configuration system that allows options to be set via files (in TOML or Yaml format) and environment variables.

The system will look for configuration in the following places, from highest to lowest priority:

  1. ARROYO__* environment variables
  2. Config file specified via the --config option
  3. Any *.toml or *.yaml files in the --config-dir directory
  4. arroyo.toml in the current directory
  5. $(user conf dir)/arroyo/config.{toml,yaml} — (this is ~/.config/arroyo on Linux and ~/Library/Application Support/arroyo on MacOS)
  6. Default configuration

Config files

In TOML or YAML, nested configurations are specified as tables under the given key name, for example:

checkpoint-url = 's3://my-bucket/checkpoints'

[controller]
scheduler = 'node'

[database]
type = "postgres"

Environment variables

All configuration options can be set as environment variables as well. To convert a config name into an environment variable, the following rule are applied:

  1. Start with ARROYO__
  2. Replace all dots (i.e., layers of nesting) with __ (double underscore)
  3. Replace all - with _ (single underscore)
  4. Uppercase all letters

Some examples:

  • checkpoint-url => ARROYO__CHECKPOINT_URL
  • pipeline.compaction.enabled => ARROYO__PIPELINE__COMPACTION__ENABLED
  • api.bind-address => API__BIND_ADDRESS

Reasonable type conversions will be applied for values specified as environment variable, for example numbers and booleans will be parsed into the correct type.

Options

Here we list all of the available configuration options by the key they are nested under. So for example, the option in the Pipeline section listed as source-batch-size would be specified in the config file as pipeline.source-batch-size or as a table

[pipeline]
source-batch-size = 128

Top-level options:

NameDescriptionDefault Value
checkpoint-urlURL of an object store or filesystem for storing checkpoints; in a distributed cluster this must be a location available to all nodes/tmp/arroyo/checkpoints
default-checkpoint-intervalDefault checkpointing interval10s
api-endpointEndpoint of the API, used by other services to connect to itinferred
controller-endpointEndpoint of the controller, used by other services to connect to itinferred
compiler-endpointEndpoint of the compiler, used by other services to connect to itinferred
disable-telemetryDisable open-source telemetryfalse

Pipeline

Configuration that applies to individual pipelines.

Key: pipeline

NameDescriptionDefault Value
source-batch-sizeMax size of source batches512
source-batch-lingerBatch linger time (how long to wait before flushing)100ms
update-aggregate-flush-intervalHow often to flush aggregates1s
allowed-restartsHow many restarts to allow before moving to failed (-1 for infinite) 20
worker-heartbeat-timeoutNumber of seconds to wait for a worker heartbeat before considering it dead30s
healthy-durationAfter this amount of time, we consider the job to be healthy and reset the restarts counter2m
worker-startup-timeAmount of time to wait for workers to start up before considering them failed10m
task-startup-timeAmount of time to wait for tasks to startup before considering it failed2m
compaction.enabledWhether to enable compaction for checkpointsfalse
compaction.checkpoints-to-compactThe number of outstanding checkpoints that will trigger compaction4

Run (pipeline clusters)

Configuration for pipeline clusters

Key: run

NameDescriptionDefault Value
queryThe query to run for this pipeline cluster (equivalent to the query command-line parameternone
state-dirSets the directory that state will be written to and read fromnone

API

Configuration for the API service

Key: api

NameDescriptionDefault Value
bind-addressThe host the API service should bind to0.0.0.0
http-portThe HTTP port for the API service5115
run-http-portThe HTTP port for the API service in run mode; defaults to a random port0

Controller

Configuration for the controller service

Key: controller

NameDescriptionDefault Value
bind-addressThe host the controller should bind to0.0.0.0
rpc-portThe RPC port for the controller5116
schedulerThe scheduler to use; one of process, kubernetes, node, or embeddedprocess

Compiler

Configuration for the UDF compiler service.

Key: compiler

NameDescriptionDefault Value
bind-addressBind address for the compiler0.0.0.0
rpc-portPort for the Compiler RPC service5117
install-clangWhether the compiler should attempt to install clang if it’s not already installedtrue
install-rustcWhether the compiler should attempt to install rustc if it’s not already installedtrue
artifact-urlWhere to store compilation artifacts, in a distributed cluster this must be a location available to all nodes/tmp/arroyo/artifacts
build-dirDirectory for build files/tmp/arroyo/build-dir
use-local-udf-crateWhether to use a local version of the UDF library or the published crate (only enable in development environments)false

Admin

Configuration for the Admin service

Key: admin

NameDescriptionDefault Value
bind-addressAddress to bind the Admin service0.0.0.0
http-portPort for the Admin HTTP service5114

Node

Configuration for the Node service

Key: node

NameDescriptionDefault Value
bind-addressAddress to bind the Node service0.0.0.0
rpc-portPort for the Node RPC service5118
task-slotsNumber of task slots for the Node16

Worker

Configuration for pipeline workers

Key: worker

NameDescriptionDefault Value
bind-addressAddress to bind the Worker service0.0.0.0
rpc-portRPC port for the worker to listen on; set to 0 to use a random available port0
data-portData port for the worker to listen on; set to 0 to use a random available port0
task-slotsNumber of task slots for the Worker16
queue-sizeSize of the queues between nodes in the dataflow graph8192

Schedulers

Configuration for the various schedulers

Process Scheduler

Key: process-scheduler

NameDescriptionDefault Value
slots-per-processNumber of slots per process in the scheduler16

Kubernetes Scheduler

Key: kubernetes-scheduler

Some values for the kubernetes scheduler are complete Kubernetes object, for example, the worker.resources object can be specified as a Kubernetes resource object.

When specifying these via environment variables they should be encoded as Yaml.

See the Kubernetes deployment docs for more details.

There are two modes for allocating resources for Kubernetes, specified as the kubernetes-scheduler.resource-mode:

  • In per-slot mode, tasks are packed onto workers up to the task-slots config, and for each slot the amount of resources specified in resources is provided. This can be much more efficient for diversely-sized pipelines
  • In per-pod mode, every pod has exactly task-slots slots, and exactly the resources in resources, even if it is scheduled for fewer slots. This is the behavior from before 0.11.
NameDescriptionDefault Value
namespaceKubernetes namespace for the schedulerdefault
resource-modeResource allocation mode; per-slot or per-podper-slot
worker.name-prefixPrefix for worker namesarroyo
worker.imageDocker image for workersghcr.io/arroyosystems/arroyo:latest
worker.image-pull-policyImage pull policy for worker containersIfNotPresent
worker.service-account-nameService account name for worker containersdefault
worker.resources.requestsKubernetes resource object representing the requests for the worker pods{cpu: "900m", memory: "500Mi"}
worker.resources.limitsKubernetes resource object representing the limits for the worker podsnone
worker.task-slotsNumber of task slots per worker16
worker.commandCommand to start worker containers/app/arroyo worker

Database

Key: database

NameDescriptionDefault Value
typeType of the database (either sqlite or postgres)sqlite
sqlite.pathPath to the database file$(user config dir)/arroyo/config.sqlite
postgres.database-nameName of the Postgres databasearroyo
postgres.hostHost of the Postgres databaselocalhost
postgres.portPort of the Postgres database5432
postgres.userUser for the Postgres databasearroyo
postgres.passwordPassword for the Postgres databasearroyo

Logging

Key: logging

NameDescriptionDefault Value
formatSet the log format (one of json, logfmt, or plaintext)plaintext
nonblockingWhether to use nonblocking logging; this uses more memory but ensures processing is not blocked by a high rate of loggingfalse
buffered-lines-limitNumber of lines to buffer before dropping logs or exerting backpressure on senders; only valid when nonblocking is set to true4096
enable-file-lineWhether to record the source file line in the logfalse
enable-file-nameWhether to record the source file name in the logfalse