Arroyo user-defined functions (UDFs) can be written in Python. UDFs are defined as Python functions, and registered with Arroyo using the @udf decorator.

Python UDFs are available in Arroyo 0.12.0 and later. Currently only scalar UDFs are supported in Python, although support for UDAFs and async UDFs is planned for future releases.

Here’s an example of a simple UDF that squares an integer:

from arroyo_udf import udf

@udf
def format_line(x: int, y: float) -> str:
    return f"{x} {y:.2f}"

For more general details about UDFs, see the UDF overview.

UDF signature

The signature of the UDF (the function name, parameters, and return type) is determined automatically from the definition of the function annotated with @udf. The function must be a valid Python function, and the parameters and return type must be Python types that have a SQL mapping. For the full list of supported types, see the SQL data types.

In order to determine the types of the UDF parameters and return value, Arroyo expects Python type hints.

Note that because Python does not have as many numeric types as SQL, multiple SQL types may map to the same Python type. For example, INT and BIGINT both map to int, and FLOAT and DOUBLE both map to float.

Nullability

SQL values are generally allowed to be null. How null values interact with your UDF is controlled via the type signature of the UDF parameters and return types. If a parameter is an Optional type (for example Optional[int]), then it will be called with all inputs, even if they are NULL. If the parameter is not an Optional type (for example int), then it will only be called with non-NULL inputs.

Similarly, if the return type is an Optional type, then the output type is nullable, otherwise it is non-nullable.

In table form:

InputParameter typeReturn typeCalled onNullability
NullableTTNon-null valuesNullable
NullableOptional[T]TAll valuesNon-null
NullableTOptional[T]Non-null valuesNullable
NullableOptional[T]Optional[T]All valuesNullable
Non-nullTTAll valuesNon-null
Non-nullOptional[T]TAll valuesNon-null
Non-nullTOptional[T]All valuesNullable
Non-nullOptional[T]Optional[T]All valuesNullable

Note that the Optional type is from the Python typing module and must be imported to use, like this:

from typing import Optional

@udf
def supply_default(x: Optional[int]) -> int:
    if x is None:
        return None
    return x * x

Python Environment

Arroyo will use the Python environment that is available on the system where the Arroyo controller and workers are running. This may be system Python or a virtualenv. If using a virtualenv, activate it before starting the Arroyo cluster.

If running Arroyo via the official Docker image, the Python environment is already provided by the image.

Minimum Python version

In order to enable high levels of scalability when running multiple Python UDFs across parallel subtasks, Arroyo relies on the new sub-interpreter feature in Python 3.12. As a result, earlier versions of Python are not supported.

Dependencies

Arroyo currently does not support installing dependencies dynamically for Python UDFs. If your UDF requires additional Python packages, you must install them manually in the Python environment where Arroyo is running. This is easiest done by using a virtualenv.

If you are interested in using libraries with Python, please get in touch with us via email at support@arroyo.systems or on Discord and we can help you with your use case.