bytewax.duckdb#

Bytewax DuckDB Sink implementation.

This module provides a robust sink for writing data to a DuckDB or MotherDuck database using Bytewax’s streaming data processing framework. The sink manages database connections, table creation, and batch writing for high-volume data flows.

Classes: DuckDBSink: A fixed partitioned sink that defines the target DuckDB or MotherDuck database and manages partition setup. DuckDBSinkPartition: A stateful partition that handles the actual data writing to the DuckDB or MotherDuck tables.

Usage: - Use the DuckDBSink class to configure the connection to the target database, specify table details, and initialize the sink for Bytewax dataflows. - The DuckDBSinkPartition class manages the writing of data in batches and executes custom SQL statements to create tables if specified.

When working with this sink in Bytewax, you can use it to process data in batch and write data to a target database or file in a structured way. However, there’s one essential assumption you need to know: the sink expects data in a specific tuple format, structured as:

1("key", List[Dict])

Where

"key": The first element is a string identifier for the batch. Think of this as a “batch ID” that helps to organize and keep track of which group of entries belong together. Every batch you send to the sink has a unique key or identifier.

List[Dict]: The second element is a list of dictionaries. Each dictionary represents an individual data record, with each key-value pair in the dictionary representing fields and their corresponding values.

Together, the tuple tells the sink: “Here is a batch of data, labeled with a specific key, and this batch contains multiple data entries.”

This format is designed to let the sink write data efficiently in batches, rather than handling each entry one-by-one. By grouping data entries together with an identifier, the sink can:

  • Optimize Writing: Batching data reduces the frequency of writes to the database or file, which can dramatically improve performance, especially when processing high volumes of data.

  • Ensure Atomicity: By writing a batch as a single unit, we minimize the risk of partial writes, ensuring either the whole batch is written or none at all. This is especially important for maintaining data integrity.

Warning: This module requires a commercial license for non-prototype use with business data. Set the environment variable BYTEWAX_LICENSE=1 to suppress the warning message in production environments.

Logging: Python’s logging library is used to log essential events, such as connection status, batch operations, and any error messages.

Note: For further examples and usage patterns, refer to the Bytewax DuckDB documentation.

Submodules#

Data#

MOTHERDUCK_SCHEME#

Classes#

class DuckDBSinkPartition(
db_path: str,
table_name: str,
create_table_sql: Optional[str],
resume_state: None,
)#
Bases:

Stateful sink partition for writing data to either local DuckDB or MotherDuck.

Initialization

Initialize the DuckDB or MotherDuck connection, and create tables if needed.

Note: To connect to a MotherDuck instance, ensure to:

  1. Create an account https://app.motherduck.com/?auth_flow=signup

  2. Generate a token https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/)

Args: db_path (str): Path to the DuckDB database file or MotherDuck connection string. table_name (str): Name of the table to write data into. create_table_sql (Optional[str]): SQL statement to create the table if the table does not already exist. resume_state (None): Unused, as this sink does not perform recovery.

write_batch(batches: List[V]) None#

Write a batch of items to the DuckDB or MotherDuck table.

Args: batches (List[V]): List of batches of items to write.

snapshot() None#

This sink does not support recovery.

close() None#

Close the DuckDB or MotherDuck connection.

class DuckDBSink(
db_path: str,
table_name: str = 'default_table',
create_table_sql: Optional[str] = None,
)#
Bases:

Fixed partitioned sink for writing data to DuckDB or MotherDuck.

This sink writes to a single output DB, optionally creating it with a create table SQL statement when first invoked.

Initialization

Initialize the DuckDBSink.

Args: db_path (str): DuckDB database file path or MotherDuck connection string. table_name (str): Name of the table to write data into. create_table_sql (Optional[str]): SQL statement to create the table if it does not already exist.

list_parts() List[str]#

Returns a single partition to write to.

Returns: List[str]: List of a single partition key.

build_part(
step_id: str,
for_part: str,
resume_state: None,
) DuckDBSinkPartition#

Build or resume a partition.

Args: step_id (str): The step ID. for_part (str): Partition key. resume_state (None): Resume state.

Returns: DuckDBSinkPartition: The partition instance.

Join our community Slack channel

Need some help? Join our community!

If you have any trouble with the process or have ideas about how to improve this document, come talk to us in the #questions-answered Slack channel!

Join now