Arroyo is a distributed stream processing engine written in Rust, designed to efficiently perform stateful computations on streams of data. Unlike traditional batch processors, streaming engines can operate on both bounded and unbounded sources, emitting results as soon as they are available.
In short: Arroyo lets you ask complex questions of high-volume real-time data with subsecond results.
Arroyo can be self-hosted, or used via the Arroyo Cloud service managed by Arroyo Systems.
- Pipelines defined in SQL, with support for complex analytical queries
- Scales up to millions of events per second
- Stateful operations like windows and joins
- State checkpointing for fault-tolerance and pipeline recovery
- Event time processing with watermark support
Some example use cases include:
- Detecting fraud and security incidents
- Real-time product and business analytics
- Real-time ingestion into your data warehouse or data lake
- Real-time ML feature generation
- Serverless operations: Arroyo pipelines are designed to run in modern cloud environments, supporting seamless scaling, recovery, and rescheduling
- High performance SQL: SQL is a first-class concern, with consistently excellent performance
- Designed for non-experts: Arroyo cleanly separates the pipeline APIs from its internal implementation. You don’t need to be a streaming expert to build real-time data pipelines.
The easiest way to get started running Arroyo is with the single-node Docker image. This contains all of the services you need to get a test Arroyo system up and running.
$ docker run -p 8000:8000 ghcr.io/arroyosystems/arroyo-single:latest
This will run all of the core Arroyo services and a temporary Postgres database. The Web UI will be exposed at http://localhost:8000.
Note that this mode should not be used for production, as it does not persist configuration state and does not support any distributed runtimes.
Arroyo can easily be run locally as well. Currently Linux and MacOS are well supported as development environments.
To run Arroyo locally follow the dev setup instructions to get set up your environment.
Arroyo supports several deployment targets for production use, including full support for Kubernetes. See the deployment docs for more information.
The metrics exposed in the UI rely on Prometheus and the Prometheus pushgateway. You will need to run these services and configure prometheus to read from the push gateway.
Download prometheus and pushgateway for your platform from https://prometheus.io/download/. Then, create a prometheus.yml file that looks like this:
global: scrape_interval: 5s scrape_configs: - job_name: pushgateway static_configs: - targets: ['localhost:9091']
Finally, start both prometheus and the pushgateway:
$ ./pushgateway $ ./prometheus --config.file=prometheus.yml
Now you’re ready to start building your first Arroyo pipeline! Check out the tutorial to get started.
By default, Arroyo collects limited and anonymous usage data to help us understand how the system is being used and to help prioritize future development.
You can opt out of telemetry by setting
DISABLE_TELEMETRY=true when running