Write Parquet and JSON to S3 and local filesystems
path
variable in the WITH
clause of the CREATE TABLE
statement.
The value is a URL pointing to the destination directory. The most common examples are shown below.
Description | Example |
---|---|
Local file | file:///test-data/my-cool-arroyo-pipeline |
S3 Path | s3://awesome-arroyo-bucket/amazing-arroyo-dir |
S3 HTTP Endpoint | https://s3.us-west-2.amazonaws.com/awesome-arroyo-bucket/amazing-arroyo-dir |
Local MinIO installation | s3::http://localhost:9123/local_bucket/sweet-dir |
Field | Description | Example |
---|---|---|
storage.aws_region | Manually set the AWS region | us-east-1 |
storage.aws_endpoint | Manually set the AWS endpoint | https://s3-custom-endpoint.com |
storage.aws_secret_access_key | Manually set the AWS secret access key | your-secret-key |
storage.aws_access_key_id | Manually set the AWS access key ID | your-access-key-id |
parquet
and json
.
Field | Description | Default | Example |
---|---|---|---|
target_file_size | Target number of bytes in a file before it is closed and a new file is opened | None | 100000000 |
target_part_size | The target size in bytes of each part of a multipart upload. Must be at least 5MB. | 5242880 | 10000000 |
max_parts | Maximum number of multipart uploads | 1000 | 50 |
rollover_seconds | Number of seconds a file will be open before it is closed and a new file is opened | 30 | 3600 |
inactivity_rollover_seconds | Number of seconds a file will be open without any new data before it is closed and a new file is opened | None | 600 |
Field | Description | Default | Example |
---|---|---|---|
filename.prefix | Prefix that will be appended to the beginning of the file name, followed by a - | None | my-prefix |
filename.suffix | Suffix that will be appended to the end of the file name, preceded by a - | None | my-suffix |
filename.strategy | Filenaming strategy to use. Supported values: serial , uuid | serial | uuid |
Field | Description | Default | Example |
---|---|---|---|
parquet_compression | The compression codec to use for Parquet files. Supported values: none , snappy , gzip , zstd , lz4 . | none | zstd |
parquet_row_batch_size | The maximum number of rows to write per record batch | 10000 | 100 |
parquet_row_group_size | The maximum number of rows to write per row group | 1000000 | 100000 |
time_partition_pattern
key in the sink to define the pattern.
Example:
time_partition_pattern = '%Y/%m/%d/%H'
field_1=X/field_2=Y
.
You can set the partition_fields
key in the sink to define the partition fields.
Example:
partition_fields = 'field_1,field_2'
event_type
and we have 100 distinct
types, at parallelism 32 we’d end up with 3,200 files being written for each
flush interval. By enabling shuffle_by_partition, we reduce that 100.
Note that this may lead to performance problems if your data is highly skewed
across your partition keys; for example, if 90% of your data is in the same
partition, those events will all end up on the same sink subtask which may
not be able to keep up with the volume.
Field | Description | Default | Example |
---|---|---|---|
compression_format | The compression format of the files to read. Supported values: none , zstd , gzip . Only used for JSON input | none | gzip |
source.regex-pattern | A regex pattern to match files to read. If specified all files within the path will be evaluated against pattern. If not specified only files directly under the path will be read. | None | .*\.json |