Execution#

Bytewax allows a dataflow to be run using multiple processes and/or threads, allowing you to scale your dataflow to take advantage of multiple cores on a single machine or multiple machines over the network. You can run a dataflow as a Python module or you can use waxctl our command line tool to run a dataflow.

Bytewax does not require a “coordinator” or “manager” process or machine to run a distributed dataflow. All worker processes perform their own coordination.

Workers#

A worker is a thread that is executing your dataflow. Workers can be grouped into separate processes, but refer to the individual threads within.

Bytewax’s execution model uses identical workers. Workers execute all steps in a dataflow and automatically exchange data to ensure the semantics of the operators.

Running a dataflow#

The bytewax.run module is executed to run dataflows. To see all of the available runtime options, run the following command:

$ python -m bytewax.run --help

Running with waxctl#

waxctl can be used to run dataflows locally as well as remotely. You will still need a functioning Python installation and Bytewax installed in the environment. You can see the available options with the command:

$ waxctl run --help

Selecting the dataflow#

The first argument passed to the script is a dataflow getter string.

The string is in the format <dataflow-module> where:

<dataflow-module> points to a Python module containing the dataflow definition.

Let’s work through two examples. Make sure you have installed Bytewax before you begin.

Create a new file named ./simple.py with the following contents:

# In simple.py
import bytewax.operators as op

from bytewax.dataflow import Dataflow
from bytewax.testing import TestingSource
from bytewax.connectors.stdio import StdOutSink

flow = Dataflow("simple")
inp = op.input("inp", flow, TestingSource(range(3)))
plus_one = op.map("plus_one", inp, lambda item: item + 1)
op.output("out", plus_one, StdOutSink())

To run this flow use simple because creating a file named simple.py results in a module just named simple:

$ python -m bytewax.run simple

1
2
3

Running dataflows with waxctl#

The waxctl run command expects only one argument, a path or URI of a python script:

$ waxctl run simple.py

You can also run a dataflow from code contained in a public GitHub repo with waxctl:

$ waxctl run https://raw.githubusercontent.com/bytewax/bytewax/main/examples/basic.py

Starting a Single Process#

By default, executing bytewax.run will run your dataflow on a single worker in the current Python process.

This avoids the overhead of setting up communication between workers/processes, but the dataflow will not have any gain from parallelization.

By changing the -w/--workers-per-process arguments, you can spawn multiple workers per process. We can run the previous dataflow with 3 workers using the same file, changing only the command:

$ python -m bytewax.run -w3 simple

Adding workers with waxctl#

$ waxctl run simple.py --workers=4

Starting a Cluster of Processes#

If you want to run Bytewax processes on one or more machines on the same network, you can use the -i/--process-id,-a/--addresses parameters.

When you specify the -i and -a flags, you are starting up a single process within a cluster of processes that you are manually coordinating. You will have to run bytewax.run multiple times to start up each process in the cluster individually.

The -a/--addresses parameter is a list of addresses:port entries for all the processes, each entry separated by a ‘;’.

When you run single processes separately, you need to assign a unique id to each process. The -i/--process-id should be a number starting from 0 representing the position of its respective address in the list passed to -a.

If for example you want to run 2 workers, on 2 different machines where the machines are known via DNS in the network as cluster_one and cluster_two, you should run the first process on cluster_one as follows:

$ python -m bytewax.run simple -i0 -a "cluster_one:2101;cluster_two:2101"

And on the cluster_two machine as:

$ python -m bytewax.run simple -i1 -a "cluster_one:2101;cluster_two:2101"

As before, each process can start multiple workers with the -w flag for increased parallelism. To start the same dataflow with a total of 6 workers:

$ python -m bytewax.run simple -w3 -i0 -a "cluster_one:2101;cluster_two:2101"

And on the cluster_two machine as:

$ python -m bytewax.run simple -w3 -i1 -a "cluster_one:2101;cluster_two:2101"

Scaling processes with waxctl#

It is simple to start multiple processes with waxctl as it will configure the addresses for you.

$ waxctl run simple.py --processes=2 --workers=4

You can specify the initial port with the -i flag and add --debug for verbose output.

$ waxctl run simple.py --processes=2 --workers=4 --initial-port=2101 --debug

For more information about deployment options for Bytewax dataflows, please see Deployment Overview.