Distributed dataflow
The dataflow can be described as a directed acylic graph (DAG). Each node performs some (possibly stateful) computation, while data flows between them along the edges. Because this is a distributed dataflow, each node in the graph can be horizontally subdivided into multiple parallel subtasks. Each subtask is responsible for a subset of the key space. There are two types of edges between nodes:- forward edges pass data to a single downstream subtask
- shuffle edges pass data to all downstream subtasks
source [0]
and source [1]
represents two subtasks of the source task (and similarly for the join).
Each subtask may be scheduled independently. If two communicating subtasks are scheduled on the same worker, the
dataflow is performed via in-memory queues; otherwise the dataflow is performed via Arroyo’s network stack which forms
a logical MxN connection between every communicating subtask.