One of the powers of Bytewax is the ability to develop dataflows that accumulate state as they process data. By default, this state is just stored in memory and will be lost if the dataflow is restarted or there is a fault. Bytewax provides some tools and techniques to allow you to recover your state and resume the dataflow after a fault.
Not all dataflows are possible to be recovered. There are a few requirements:
Replayable Input - Your data source needs to support re-playing input from a specific point in the past we call the resume state to resume making progress from where the failure happened.
At-least-once Output - Bytewax only provides at-least-once guarantees when sending data into a downstream system when performing a recovery. You should design your architecture to support this via some sort of idempotency.
Metadata around input order for manual input - The
resume_stateis the minimum unit of recovery for inputs using a
ManualInputConfig. You will need to design your input builder to be able to resume from a location of failure. If you're using
KafkaInputConfig, we handle this for you.
Your dataflow needs a few features to enable recovery on Bytewax:
You must design your input builder to be able to resume and replay all data from the
resume_stateonwards. Do not ignore this argument!
Create a recovery config for your workers to know how to connect to the recovery store, the database or file system that will store recovery data. See our API docs for
bytewax.recoveryon our supported options.
Pass the recovery config as the
recovery_configargument to whatever execution entry point (e.g.
cluster_main()) you are using.
Run the dataflow! Bytewax will automatically snapshots of state and the resume epoch to the recovery store.
If a dataflow has failed, follow this high-level game plan to recover and have it resume processing from where it failed:
Terminate the dataflow processes.
Restart your dataflow using the same recovery config. Bytewax will read the input state from the recovery store and resume output from the latest possible location in the stream.
Here are some concrete examples of using the recovery machinery.