Customizing the system behavior
ARROYO__*
environment variables--config
option--config-dir
directoryarroyo.toml
in the current directory$(user conf dir)/arroyo/config.{toml,yaml}
— (this is ~/.config/arroyo
on Linux and ~/Library/Application Support/arroyo
on MacOS)ARROYO__
__
(double underscore)-
with _
(single underscore)checkpoint-url
=> ARROYO__CHECKPOINT_URL
pipeline.compaction.enabled
=> ARROYO__PIPELINE__COMPACTION__ENABLED
api.bind-address
=> ARROYO__API__BIND_ADDRESS
source-batch-size
would be specified in the config file as
pipeline.source-batch-size
or as a table
Name | Description | Default Value |
---|---|---|
checkpoint-url | URL of an object store or filesystem for storing checkpoints; in a distributed cluster this must be a location available to all nodes | /tmp/arroyo/checkpoints |
default-checkpoint-interval | Default checkpointing interval | 10s |
api-endpoint | Endpoint of the API, used by other services to connect to it | inferred |
controller-endpoint | Endpoint of the controller, used by other services to connect to it | inferred |
compiler-endpoint | Endpoint of the compiler, used by other services to connect to it | inferred |
disable-telemetry | Disable open-source telemetry | false |
pipeline
Name | Description | Default Value |
---|---|---|
source-batch-size | Max size of source batches | 512 |
source-batch-linger | Batch linger time (how long to wait before flushing) | 100ms |
update-aggregate-flush-interval | How often to flush aggregates | 1s |
allowed-restarts | How many restarts to allow before moving to failed (-1 for infinite) 20 | |
worker-heartbeat-timeout | Number of seconds to wait for a worker heartbeat before considering it dead | 30s |
healthy-duration | After this amount of time, we consider the job to be healthy and reset the restarts counter | 2m |
worker-startup-time | Amount of time to wait for workers to start up before considering them failed | 10m |
task-startup-time | Amount of time to wait for tasks to startup before considering it failed | 2m |
compaction.enabled | Whether to enable compaction for checkpoints | false |
compaction.checkpoints-to-compact | The number of outstanding checkpoints that will trigger compaction | 4 |
chaining.enabled | Whether to enable operator chaining, which reduces the number of operators in the pipeline | false |
run
Name | Description | Default Value |
---|---|---|
query | The query to run for this pipeline cluster (equivalent to the query command-line parameter | none |
state-dir | Sets the directory that state will be written to and read from | none |
api
Name | Description | Default Value |
---|---|---|
bind-address | The host the API service should bind to | 0.0.0.0 |
http-port | The HTTP port for the API service | 5115 |
run-http-port | The HTTP port for the API service in run mode; defaults to a random port | 0 |
controller
Name | Description | Default Value |
---|---|---|
bind-address | The host the controller should bind to | 0.0.0.0 |
rpc-port | The RPC port for the controller | 5116 |
scheduler | The scheduler to use; one of process , kubernetes , node , or embedded | process |
compiler
Name | Description | Default Value |
---|---|---|
bind-address | Bind address for the compiler | 0.0.0.0 |
rpc-port | Port for the Compiler RPC service | 5117 |
install-clang | Whether the compiler should attempt to install clang if it’s not already installed | true |
install-rustc | Whether the compiler should attempt to install rustc if it’s not already installed | true |
artifact-url | Where to store compilation artifacts, in a distributed cluster this must be a location available to all nodes | /tmp/arroyo/artifacts |
build-dir | Directory for build files | /tmp/arroyo/build-dir |
use-local-udf-crate | Whether to use a local version of the UDF library or the published crate (only enable in development environments) | false |
admin
Name | Description | Default Value |
---|---|---|
bind-address | Address to bind the Admin service | 0.0.0.0 |
http-port | Port for the Admin HTTP service | 5114 |
node
Name | Description | Default Value |
---|---|---|
bind-address | Address to bind the Node service | 0.0.0.0 |
rpc-port | Port for the Node RPC service | 5118 |
task-slots | Number of task slots for the Node | 16 |
worker
Name | Description | Default Value |
---|---|---|
bind-address | Address to bind the Worker service | 0.0.0.0 |
rpc-port | RPC port for the worker to listen on; set to 0 to use a random available port | 0 |
data-port | Data port for the worker to listen on; set to 0 to use a random available port | 0 |
task-slots | Number of task slots for the Worker | 16 |
queue-size | Size of the queues between nodes in the dataflow graph | 8192 |
process-scheduler
Name | Description | Default Value |
---|---|---|
slots-per-process | Number of slots per process in the scheduler | 16 |
kubernetes-scheduler
Some values for the kubernetes scheduler are complete Kubernetes object, for
example, the worker.resources
object can be specified as a
Kubernetes resource object.
When specifying these via environment variables they should be encoded as Yaml.
See the Kubernetes deployment docs for more details.
There are two modes for allocating resources for Kubernetes, specified as the kubernetes-scheduler.resource-mode
:
per-slot
mode, tasks are packed onto workers up to the task-slots
config, and for each slot the amount of resources specified in resources
is
provided. This can be much more efficient for diversely-sized pipelinesper-pod
mode, every pod has exactly task-slots
slots, and exactly the
resources in resources
, even if it is scheduled for fewer slots. This
is the behavior from before 0.11.Name | Description | Default Value |
---|---|---|
namespace | Kubernetes namespace for the scheduler | default |
resource-mode | Resource allocation mode; per-slot or per-pod | per-slot |
worker.name-prefix | Prefix for worker names | arroyo |
worker.image | Docker image for workers | ghcr.io/arroyosystems/arroyo:latest |
worker.image-pull-policy | Image pull policy for worker containers | IfNotPresent |
worker.service-account-name | Service account name for worker containers | default |
worker.resources.requests | Kubernetes resource object representing the requests for the worker pods | {cpu: "900m", memory: "500Mi"} |
worker.resources.limits | Kubernetes resource object representing the limits for the worker pods | none |
worker.task-slots | Number of task slots per worker | 16 |
worker.command | Command to start worker containers | /app/arroyo worker |
worker.env | List of environment variables for worker containers, each a k8s-style map with name and value keys | none |
database
Name | Description | Default Value |
---|---|---|
type | Type of the database (either sqlite or postgres ) | sqlite |
sqlite.path | Path to the database file | $(user config dir)/arroyo/config.sqlite |
postgres.database-name | Name of the Postgres database | arroyo |
postgres.host | Host of the Postgres database | localhost |
postgres.port | Port of the Postgres database | 5432 |
postgres.user | User for the Postgres database | arroyo |
postgres.password | Password for the Postgres database | arroyo |
logging
Name | Description | Default Value |
---|---|---|
format | Set the log format (one of json , logfmt , or plaintext ) | plaintext |
nonblocking | Whether to use nonblocking logging; this uses more memory but ensures processing is not blocked by a high rate of logging | false |
buffered-lines-limit | Number of lines to buffer before dropping logs or exerting backpressure on senders; only valid when nonblocking is set to true | 4096 |
enable-file-line | Whether to record the source file line in the log | false |
enable-file-name | Whether to record the source file name in the log | false |