Creating a demo source
Arroyo includes several generated data sources we can use to try out the engine. To create one, go to the Connections tab on the left menu, and click “Create Connection”. From there, you can choose what type of connection you would like to create. For this tutorial, we’re going to create a Nexmark source. Nexmark is a standard benchmark for streaming systems. It involves various query patterns over a data set from a simulated auction website. Find Nexmark in the list of connectors and click “Create.” Then set the desired event rate. For this tutorial, 100 messages/sec should suffice. Leave the other fields blank. Click next, and finally give name the sourcenexmark
, and click “Test
Connection,” then “Create” to finish creating the source.
Creating a pipeline
Now that we’ve created a source to query, we can start to build pipelines by writing SQL. Navigate to the Pipelines page on the left sidebar, then click “Create Pipeline.” This opens up the SQL editor. On the left is a list of tables that we can query (our catalog). The main UI consists of a text editor where we can write SQL, and actions we can take with that query, like previewing it on our data set.First query
Let’s start with a simple query. Type this into the editor:
Windows
Many streaming pipelines involve working with time in some way. Arroyo supports a few different ways of expressing computations over the time characteristic of our data. Let’s add a sliding window (calledhop
in SQL) to perform a
time-oriented aggregation:
Running the pipeline
Once we’re happy with it, we can run the pipeline for real. Click the “Launch” button, give it a name (for example “top_auctions”). Because we didn’t specify a sink in the query (via anINSERT INTO
statement),
Arroyo will automatically add a Web sink so that we can view the results in the Web UI. Click “Start” to create
the pipeline.
This will start the pipeline. Once it’s running, we can click nodes on the pipeline dataflow graph and
see metrics for that operator. Clicking into the Outputs tab we can tail the results for the pipeline. The Checkpoints
tab shows metadata for the checkpoints for the pipeline. Arroyo regularly takes consistent checkpoints of the state
of the pipeline (using a variation of asynchronous barrier snapshotting algorithm described in
this paper) so we can recover from failure.
We can also control execution of the pipeline, stopping and starting it. Before stopping the pipeline Arroyo takes
a final checkpoint so that we can restart it without any data loss.