Joins
Joining data in Arroyo SQL
Arroyo SQL supports various types of joins over streaming and batch data sources.
Window joins
Window joins operate over bounded time-oriented windows (one of TUMBLE, HOP, or SESSION). Each subquery being joined must be computed over the exact same window, and all matching elements within that window (according to the type of join) will be returned.
For example, a join over TUMBLE(interval '1 minute')
will be computed once a minute, and return
all matching elements within that minute. Window completions are triggered by the arrival of watermarks,
as described here.
Window joins default to INNER joins, but you may also specify LEFT
, RIGHT
, or FULL
before the JOIN
keyword.
A full example of a window join looks like this:
Note that it is not currently possible to reaggregate the results of a windowed join.
Updating joins
Unlike window joins, updating joins do not require inputs to be time-windowed. These
behave like standard SQL joins, but are computed incrementally and produce updating output
(a stream of delete, append, and update results), which must be sent to an update-compatible
sink (for example, using the debezium_json
format).
An example updating join looks like this:
As updating joins are not time-bounded, there is nothing that bounds the state
size; in principle they must remember all data the pipeline has seen. This is
not usually practical, so Arroyo by default uses a TTL of 1 day after which join
state is discarded. This may be configured via SET updating_ttl
. See the
updating docs for more
details.
Lookup joins
Lookup joins allow you to enrich streams by referencing external or static data
stored in external systems (e.g., Redis, relational databases). The right side
of a lookup join is a special kind of table, called a TEMPORARY TABLE
, which
is unmaterialized and backed by the external system. Each record that comes into
the join from the left (stream) side causes a query to the underlying system.
For example, you may have a set of detailed customer records stored in an RDBMS that you would like to use to enrich an analytics stream.
Currently, Redis is supported as a lookup connector.
An example lookup join looks like this:
The lookup.cache.max_bytes
and lookup.cache.ttl
are optional arguments that
control the behavior of the built-in cache, which avoids the need to query the
same keys over and over again.
Lookup joins can be either INNER (the default) or LEFT.