Semantics for updating tables in Arroyo SQL
HOP
and TUMBLE
, we compute aggregates for a window once the watermark passes.
This is a powerful model, as allows the streaming system to signal completeness to its
consumers, so they don’t need to reason about when they are able to trust the results. However,
the requirement that all computations are windowed can be limiting.
Updating semantics, on the other hand, allow for more flexibility in the types of queries that can be
expressed, including most of normal batch SQL. It works by treating the input stream as a table that
is constantly being updated, and the output as updates on a materialized view of the query.
When writing to a sink, the output is a stream of updates (in
Debezium format), representing additions, updates, and deletes
to the materialized view.
'debezium_json'
to read Debezium formatted messages.
Updating sources need at least one primary key, which tells Arroyo which
rows are logically the same. The primary key is specified in the DDL, like this:
debezium_json
format. This output
can then be consumed by the Debezium sink connector
to write to a RDBMS like MySQL or Postgres.
For a complete example of this, see this tutorial.
SET updating_ttl
command, which
takes a SQL interval. For example, to set the TTL to 1 hour:
pipeline.update-aggregate-flush-interval
config.
For instance, the following query
before | after | op |
---|---|---|
null | { "rows": 100} | "c" |
{ "rows": 100 } | { "rows": 200} | "u" |
{ "rows": 200 } | { "rows": 300} | "u" |
{ "rows": 300 } | { "rows": 400} | "u" |
{ "rows": 400 } | null | "d" |