bytewax.duckdb#
Bytewax DuckDB Sink implementation.
This module provides a robust sink for writing data to a DuckDB or MotherDuck database using Bytewax’s streaming data processing framework. The sink manages database connections, table creation, and batch writing for high-volume data flows.
Classes: DuckDBSink: A fixed partitioned sink that defines the target DuckDB or MotherDuck database and manages partition setup. DuckDBSinkPartition: A stateful partition that handles the actual data writing to the DuckDB or MotherDuck tables.
Usage:
- Use the DuckDBSink class to configure the connection to the target
database, specify table details, and initialize the sink for Bytewax dataflows.
- The DuckDBSinkPartition class manages the writing of data in batches
and executes custom SQL statements to create tables if specified.
When working with this sink in Bytewax, you can use it to process data in batch and write data to a target database or file in a structured way. However, there’s one essential assumption you need to know: the sink expects data in a specific tuple format, structured as:
1("key", List[Dict])
Where
"key": The first element is a string identifier for the batch.
Think of this as a “batch ID” that helps to organize and keep
track of which group of entries belong together.
Every batch you send to the sink has a unique key or identifier.
List[Dict]: The second element is a list of dictionaries.
Each dictionary represents an individual data record,
with each key-value pair in the dictionary representing
fields and their corresponding values.
Together, the tuple tells the sink: “Here is a batch of data, labeled with a specific key, and this batch contains multiple data entries.”
This format is designed to let the sink write data efficiently in batches, rather than handling each entry one-by-one. By grouping data entries together with an identifier, the sink can:
Optimize Writing: Batching data reduces the frequency of writes to the database or file, which can dramatically improve performance, especially when processing high volumes of data.
Ensure Atomicity: By writing a batch as a single unit, we minimize the risk of partial writes, ensuring either the whole batch is written or none at all. This is especially important for maintaining data integrity.
Warning:
This module requires a commercial license for non-prototype use with
business data. Set the environment variable BYTEWAX_LICENSE=1 to suppress
the warning message in production environments.
Logging: Python’s logging library is used to log essential events, such as connection status, batch operations, and any error messages.
Note: For further examples and usage patterns, refer to the Bytewax DuckDB documentation.
Submodules#
Data#
- MOTHERDUCK_SCHEME#
Classes#
- class DuckDBSinkPartition( )#
- Bases:
Stateful sink partition for writing data to either local DuckDB or MotherDuck.
Initialization
Initialize the DuckDB or MotherDuck connection, and create tables if needed.
Note: To connect to a MotherDuck instance, ensure to:
Create an account https://app.motherduck.com/?auth_flow=signup
Generate a token https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/)
Args: db_path (str): Path to the DuckDB database file or MotherDuck connection string. table_name (str): Name of the table to write data into. create_table_sql (Optional[str]): SQL statement to create the table if the table does not already exist. resume_state (None): Unused, as this sink does not perform recovery.
- class DuckDBSink( )#
- Bases:
Fixed partitioned sink for writing data to DuckDB or MotherDuck.
This sink writes to a single output DB, optionally creating it with a create table SQL statement when first invoked.
Initialization
Initialize the DuckDBSink.
Args: db_path (str): DuckDB database file path or MotherDuck connection string. table_name (str): Name of the table to write data into. create_table_sql (Optional[str]): SQL statement to create the table if it does not already exist.
- list_parts() List[str]#
Returns a single partition to write to.
Returns: List[str]: List of a single partition key.
- build_part( ) DuckDBSinkPartition#
Build or resume a partition.
Args: step_id (str): The step ID. for_part (str): Partition key. resume_state (None): Resume state.
Returns: DuckDBSinkPartition: The partition instance.