format
option in SQL.
Json
JSON is a common format for data interchange. Arroyo supports two flavors of JSON:json
- JSON data in any formatdebezium_json
- JSON data in the format produced by Debezium for reading and writing from relational databases like Postgres
Option | Description | Default |
---|---|---|
format | The format of the data. Must be one of json or debezium_json . | json |
json.confluent_schema_registry | Set to true if data was produced by (or will be consumed via) a Confluent Schema Registry connected source | false |
json.include_schema | Set to true to include the schema in the output, allowing it to be used with Kafka Connect connectors that require a schema | false |
json.unstructured | Set to true to treat the data as unstructured JSON, which will be parsed as a single column of type TEXT | false |
json.timestamp_format | The format of timestamps in the data. May be one of rfc3339 or unix_millis | rfc3339 |
TEXT
with JSON-serialized data in it.
Unstructured data is treated as a single column named value
with type TEXT
, which can be operated on using
SQL json functions or UDFs.
Avro
Avro is a binary data format that is commonly used for data applications. Avro is a schema-based format, and requires readers and writers to have access to the schema in order to read and write data. Avro is well-supported by the Kafka ecosystem where the Confluent Schema Registry, is a popular choice for storing and serving Avro schemas. Aroryo is able to read and write Avro schemas from the Schema Registry. The following options are supported:Option | Description | Default |
---|---|---|
format | The format of the data. Must be avro | avro |
avro.confluent_schema_registry | Set to true if data was produced by (or will be consumed via) a Confluent Schema Registry connected source | false |
avro.raw_datums | Set to true to serialize and deserialize as raw Avro datums instead of complete Avro records | false |
avro.into_unstructured_json | Convert the avro record to JSON | false |
- As a raw Avro datum, which is just the data itself
- As a complete Avro document, which includes the schema and metadata
avro.into_unstructured_json
option, if set, will cause the Avro data to be deserialized and re-serialized
to JSON, which can then be operated on using SQL json functions or UDFs.
This can be useful if the Avro schema for the data may change, and offers flexibility in how the data is
processed.
Schema Registry Integration
For Kafka sources configured via the Web UI, Arroyo is able to automatically fetch the schema from the Schema Registry and use it to determine the schema for the table. When using Avro with a Kafka sink, Arroyo will automatically register the schema with the Schema Registry so long as theavro.confluent_schema_registry
option is set to true
. This allows the schema to be used by
other applications that read from the same topic.
Protobuf
Protocol Buffers is a binary data format developed by Google that is commonly used for data interchange. Arroyo supports reading Protobuf data provided that the schema is available, along with support for fetching schemas from Confluent Schema Registry. Sources with Protobuf schemas must currently be created via the Web UI or API, and are not yet supported in SQL. The Protobuf schema definition looks like this:
Schema registry
.
Raw string
To ingest or emit arbitrary string data (encoded as UTF-8), you can use theraw_string
format. Raw string tables have a single column named value
with
type TEXT
, for example:
Raw bytes
The raw bytes format allows users to ingest and emit arbitrary binary data. Together with a UDFs, this allows you to implement binary formats that are internal or otherwise not supported natively by Arroyo. Raw bytes tables have a single column namedvalue
with type BYTEA
:
&[u8]
, for example: