UDFs
User-defined functions
Arroyo SQL can be extended with user-defined functions. Currently, only Rust UDFs are supported. UDFs can be defined as part of the SQL API call, or via the Web UI.
UDFs are defined as Rust functions. The parameters and return type of the UDF are determined by the definition of the function. All types must be valid SQL data types.
For String and Binary types (TEXT
BYTEA
in SQL), UDFs use the reference type
(&str
and &[u8]
) for arguments and the owned types (String
and Vec<u8>
)
for return values.
Arroyo UDFs are annotated with the #[udf]
in the arroyo_udf_plugin crate.
Here’s an example of a simple UDF that squares an integer:
use arroyo_udf_plugin::udf;
#[udf]
fn my_square(x: i64) -> i64 {
x * x
}
Once registered, this UDF can be used in SQL queries:
SELECT my_square(auction.bid)
FROM nexmark;
For a complete example of how to use UDFs to solve a real-world problem of custom format parsing, see the UDF tutorial.
Creating UDFs
UDFs are developed and managed in the Web UI, within the Pipeline editor. To create a new UDF, navigate to the UDFs tab and click “New.”
The name for the UDF is determined automatically from the function name that is annotated
with #[udf]
. You may include helper functions, structs, and other definitions so
long as only one function has the #[udf]
annotation.
As you’re developing, you may make use of the “Check” button to validate the UDF definition. Any Rust compilation errors will be included in the errors box at the bottom of the UI.
Nullability
Nullability handling is controlled via the type signature of the UDF parameters and return types. If a parameter is
an Option
type (for example Option<i64>
), then it will be called with all inputs, even if they are NULL
. If the
parameter is not an Option
type (for example i64
), then it will only be called with non-NULL
inputs.
Similarly, if the return type is an Option
type, then the output type is nullable, otherwise it is non-nullable.
In table form:
Input | Parameter type | Return type | Called on | Nullability |
---|---|---|---|---|
Nullable | T | T | Non-null values | Nullable |
Nullable | Option<T> | T | All values | Non-null |
Nullable | T | Option<T> | Non-null values | Nullable |
Nullable | Option<T> | Option<T> | All values | Nullable |
Non-null | T | T | All values | Non-null |
Non-null | Option<T> | T | All values | Non-null |
Non-null | T | Option<T> | All values | Nullable |
Non-null | Option<T> | Option<T> | All values | Nullable |
Dependencies
UDFs can depend on external crates. To add dependencies, you can define a special comment in the UDF definition like this:
/*
[dependencies]
regex = "1.10.2"
*/
use arroyo_udf_plugin::udf;
use regex::Regex;
#[udf]
fn my_regex_matches(s: &str) -> bool {
let re = Regex::new(r"test").unwrap();
re.is_match(s)
}
Internally, the contents of the [dependencies]
comment are used to generate a Cargo.toml
file for the UDF.
See the Cargo.toml reference
for more details on the syntax.
Dependencies may also include environment variables, which will be substituted at compile time:
/*
[dependencies]
my-repo = { git = "https://{{ GITHUB_USER }}:{{ GITHUB_TOKEN }}@github.com/{{ GITHUB_ORG }}/my-repo.git" }
*/
Global UDFs
UDFs may be either local—associated with a particular pipeline—or global, in which case they are stored in the database and may be used in any pipeline.
All UDFs begin as local. If you’ve written a UDF that will benefit from being globally available, you can make it global by hovering over the tab containing the UDF, and clicking the “Globalize” button in the drop down.
Global UDFs are available in all pipelines, but will be shadowed by local UDFs with the same name.
Performance considerations
UDFs are expected to run and return quickly (think microseconds, not milliseconds). This means they shouldn’t do any blocking work (like making network requests) or CPU-intensive computations. Within the Arroyo dataflow, UDFs are treated as normal scalar functions. They run serially against a batch of data, and while they run forward progress is blocked.
If you need to use IO, CPU-intensive computations, or other usecases for long-running UDFs, see async UDFs.
Debugging UDFs
If UDFs panic, they will produce log lines in the worker logs. These are not currently forwarded to the Web UI, so you
will need to find the logs where the workers are running. For the default process
scheduler, this will be in the
controller logs. For the kubernetes
scheduler, this will be within the actual worker pods.