File System

Arroyo provides the capability to read and write Parquet and JSON files to/from object stores and local filesystems. When used as a sink, Arroyo will produce complete files in line with the checkpointing system. As such, the file system sinks write all data exactly once. This is done against S3 by tracking multi-part uploads within the state store, allowing Arroyo to resume an in-progress upload in the event of a failure. The FileSystem connector supports local filesystem, S3 (including S3-compatible stores like MinIO), and GCS. As a source Arroyo reads all files to completion, at which point the job will finish.

Common Configuration

Both the source and sink versions of the connector make use of Arroyo’s StorageBackend, which is a generalization of an object store. The location within the StorageBackend is configured via the path variable in the WITH clause of the CREATE TABLE statement. The value is a URL pointing to the destination directory. The most common examples are shown below.

Description	Example
Local file	`file:///test-data/my-cool-arroyo-pipeline`
S3 Path	`s3://awesome-arroyo-bucket/amazing-arroyo-dir`
S3 HTTP Endpoint	`https://s3.us-west-2.amazonaws.com/awesome-arroyo-bucket/amazing-arroyo-dir`
Local MinIO installation	`s3::http://localhost:9123/local_bucket/sweet-dir`

Additional Backend Configuration

The StorageBackend can be passed additional configuration options. Currently this is only supported for S3-API based backends (including MinIO), and are namespaced with “storage.” at the beginning. This allows you to pass in custom endpoints, credentials, and regions. The most common options are listed below.

Field	Description	Example
`storage.aws_region`	Manually set the AWS region	`us-east-1`
`storage.aws_endpoint`	Manually set the AWS endpoint	`https://s3-custom-endpoint.com`
`storage.aws_secret_access_key`	Manually set the AWS secret access key	`your-secret-key`
`storage.aws_access_key_id`	Manually set the AWS access key ID	`your-access-key-id`

Format

Both sources and sinks require a format, and support parquet and json.

Sink Specific Configuration

Output File Options

Field	Description	Default	Example
`target_file_size`	Target number of bytes in a file before it is closed and a new file is opened	`None`	`100000000`
`target_part_size`	The target size in bytes of each part of a multipart upload. Must be at least 5MB.	`5242880`	`10000000`
`max_parts`	Maximum number of multipart uploads	`1000`	`50`
`rollover_seconds`	Number of seconds a file will be open before it is closed and a new file is opened	`30`	`3600`
`inactivity_rollover_seconds`	Number of seconds a file will be open without any new data before it is closed and a new file is opened	`None`	`600`

File Naming Options

By default Arroyo writes sequential files, tracking a max_file_index as well as including the subtask. However, there are several alternate naming schemes available.

Field	Description	Default	Example
`filename.prefix`	Prefix that will be appended to the beginning of the file name, followed by a `-`	`None`	`my-prefix`
`filename.suffix`	Suffix that will be appended to the end of the file name, preceded by a `-`	`None`	`my-suffix`
`filename.strategy`	Filenaming strategy to use. Supported values: `serial`, `uuid`	`serial`	`uuid`

Parquet Options

Parquet has a few supported options. Please reach out if you’d like to expand the settings allowed.

Field	Description	Default	Example
`parquet_compression`	The compression codec to use for Parquet files. Supported values: `none`, `snappy`, `gzip`, `zstd`, `lz4`.	`none`	`zstd`
`parquet_row_batch_size`	The maximum number of rows to write per record batch	`10000`	`100`
`parquet_row_group_size`	The maximum number of rows to write per row group	`1000000`	`100000`

Partitioning Options

Arroyo supports partitioning of outputs. There are two types of partitioning: event time-based and field-based. You can use either or both of these types of partitioning. If both are used, the time-based partitioning is placed prior to the field-based partitioning.

Event Time-based Partitioning

Event time partitioning uses each record’s event_time, formatting it using a strftime-style formatting string. You can set the time_partition_pattern key in the sink to define the pattern. Example: time_partition_pattern = '%Y/%m/%d/%H'

Field-based Partitioning

Field-based formatting produces a string mirroring the Hive-style partition directories, so partitioning on field_1, field_2 will result in folders like field_1=X/field_2=Y. You can set the partition_fields key in the sink to define the partition fields. Example: partition_fields = 'field_1,field_2'

Shuffle by partition

When using field-based partitioning and high parallelism, you may end up with many files; typically each sink subtask will write a file for every partition key. To avoid this, you can configure the dataflow to insert a shuffle step before the sink, which will ensure that all records for a particular partition key end up on the same sink node:

'shuffle_by_partition.enabled' = true

For example, if our partition key is event_type and we have 100 distinct types, at parallelism 32 we’d end up with 3,200 files being written for each flush interval. By enabling shuffle_by_partition, we reduce that 100. Note that this may lead to performance problems if your data is highly skewed across your partition keys; for example, if 90% of your data is in the same partition, those events will all end up on the same sink subtask which may not be able to keep up with the volume.

Source Specific Configuration

When using the file system source, the following options are available

Field	Description	Default	Example
compression_format	The compression format of the files to read. Supported values: `none`, `zstd`, `gzip`. Only used for JSON input	`none`	`gzip`
source.regex-pattern	A regex pattern to match files to read. If specified all files within the path will be evaluated against pattern. If not specified only files directly under the path will be read.	None	`.*\.json`

File System Sink DDL

Here’s an example for how to create a table to write parquet to S3 with partitioning:

CREATE TABLE bids (
  auction bigint,
  bidder bigint,
  price bigint,
  datetime timestamp,
  region text,
  account_id text
) WITH (
  connector = 'filesystem',
  type = 'sink',
  path = 'https://s3.us-west-2.amazonaws.com/demo/s3-uri',
  format = 'parquet',
  parquet_compression = 'zstd',
  rollover_seconds = 60,
  time_partition_pattern = '%Y/%m/%d/%H',
  partition_fields = 'region,account_id'
);

File System Source DDL

Here’s an example for how to create a table to read parquet from S3:

CREATE TABLE bids (
  auction bigint,
  bidder bigint,
  price bigint,
  datetime timestamp,
  region text,
  account_id text
) WITH (
  connector = 'filesystem',
  type = 'source',
  path = 'https://s3.us-west-2.amazonaws.com/demo/s3-uri',
  format = 'parquet',
  'source.regex-pattern' = '.*\.parquet$'
);

Home

Tutorial

Sources and Sinks

SQL Reference

User-Defined Functions

Deployment

Arroyo Development

Common Configuration

Additional Backend Configuration

Format

Sink Specific Configuration

Output File Options

File Naming Options

Parquet Options

Partitioning Options

Event Time-based Partitioning

Field-based Partitioning

Shuffle by partition

Source Specific Configuration

File System Sink DDL

File System Source DDL

Home

Tutorial

Sources and Sinks

SQL Reference

User-Defined Functions

Deployment

Arroyo Development

​Common Configuration

​Additional Backend Configuration

​Format

​Sink Specific Configuration

​Output File Options

​File Naming Options

​Parquet Options

​Partitioning Options

​Event Time-based Partitioning

​Field-based Partitioning

​Shuffle by partition

​Source Specific Configuration

​File System Sink DDL

​File System Source DDL

Common Configuration

Additional Backend Configuration

Format

Sink Specific Configuration

Output File Options

File Naming Options

Parquet Options

Partitioning Options

Event Time-based Partitioning

Field-based Partitioning

Shuffle by partition

Source Specific Configuration

File System Sink DDL

File System Source DDL