Using `waxctl` to Run your Dataflow on Kubernetes#

Waxctl helps you run and manage your Bytewax dataflows in Kubernetes. It uses the current kubectl context configuration, so you need kubectl configured to access the desired Kubernetes cluster for your dataflows.

Check kubectl documentation here if you don’t have it installed yet.

Installation#

Installing Waxctl is very simple. You just need to download the binary corresponding to your operating system and architecture here.

Dataflow Lifecycle#

Waxctl allows you to manage the entire dataflow program lifecycle which includes these phases:

Deployment
Getting Status
Modification
Deletion

In the following sections we are going to cover each one.

Available Commands#

For now Waxctl only has one available command and that is dataflow. That command is also aliased to df so any of them can be used.

You can manage the entire lifecycle of a dataflow in Kubernetes. To do so, the dataflow command has the following sub-commands:

delete
deploy
list

Running dataflow --help will show further details for these:

$ waxctl dataflow --help
Manage dataflows in Kubernetes.

Usage:
  waxctl dataflow [command]

Aliases:
  dataflow, df

Available Commands:
  delete      delete a dataflow
  deploy      deploy a dataflow to Kubernetes creating or upgrading it resources
  list        list dataflows deployed

Flags:
  -h, --help   help for dataflow

Global Flags:
      --debug   enable verbose output

Use "waxctl dataflow [command] --help" for more information about a command.

Waxctl and Namespaces#

Waxctl behavior leverages namespaces similarly to other kubernetes tools like helm. If you don’t specify a namespace, the tool will use the current kubectl namespace.

You can specify an explicit namespace in every Waxctl command using the --namespace flag.

Deploying a Dataflow#

To deploy a dataflow you just need to run dataflow deploy passing the path of your python script as an argument. If left unset, Waxctl will use the default name for your dataflow, which is bytewax.

In our example we are going to deploy a dataflow called my-dataflow in the current namespace which is bytewax in our case:

$ waxctl df deploy /var/bytewax/examples/basic.py --name my-dataflow
Dataflow my-dataflow deployed in bytewax namespace.

In the above example, Waxctl used the default values for all the flags besides name. The tool allows you to configure a wide range of characteristics of your dataflow.

We can see which flags are available geting the dataflow deploy help.

$ waxctl df deploy --help
Deploy a dataflow to Kubernetes using the Bytewax helm chart.

The resources are going to be created if the dataflow doesn't exist or
upgraded if the dataflow is already deployed in the Kubernetes cluster.

The deploy command expects only one argument, a path of a file which could be a
python script or a tar file which must contain your script file (and normally one
or more files needed by your script):

Examples:
  # Deploy a dataflow in current namespace running my-script.py file.
  waxctl dataflow deploy my-script.py

  # Deploy a dataflow in my-namespace namespace running my-script.py file.
  waxctl dataflow deploy my-script.py --namespace my-namespace

  # Deploy a dataflow running my-script.py file contained in my-files.tar file.
  waxctl dataflow deploy my-files.tar -f my-script.py

Usage:
  waxctl dataflow deploy [PATH] [flags]

Flags:
      --create-namespace              create namespace if it does not exist (default true)
      --dry-run                       output the manifests generated without installing them to Kubernetes
  -h, --help                          help for deploy
  -s, --image-pull-secret string      kubernetes secret name to pull custom image (default "default-credentials")
  -i, --image-repository string       custom container image repository URI (default "bytewax/bytewax")
  -t, --image-tag string              custom container image tag (default "latest")
  -N, --name string                   name of the dataflow to deploy in Kubernetes (default "bytewax")
  -n, --namespace string              namespace of Kubernetes resources to deploy
  -o, --output-format output-format   output format for --dry-run; can be 'yaml' or 'json' (default yaml)
  -p, --processes int                 number of processes to deploy. (default 1)
  -f, --python-file-name string       python script file to run. Only needed when [PATH] is a tar file
  -w, --workers int                   number of workers to run in each process (default 1)

Global Flags:
      --debug   enable verbose output

Getting Dataflow Information#

You can know which dataflows are deployed in your Kubernetes cluster using the dataflow list sub-command. In the output we are going to find details about each dataflow as we can see in this example:

$ waxctl df ls
[
  {
    "name": "my-dataflow",
    "namespace": "bytewax",
    "containerImage": "bytewax/bytewax:latest",
    "containerImagePullSecret": "default-credentials",
    "pythonScriptFile": "/var/bytewax/basic.py",
    "processes": "1",
    "processesReady": "1",
    "workersPerProcess": "1",
    "creationTimestamp": "2022-04-18T10:51:03-03:00"
  }
]

This is the help text of that command:

$ waxctl df list --help
List dataflows deployed.

Examples:
  # List dataflows in current namespace
  waxctl dataflow list

  # List dataflows in bytewax namespace
  waxctl dataflow ls -n bytewax

  # List dataflows across all namespaces
  waxctl dataflow ls -A

Usage:
  waxctl dataflow list [flags]

Aliases:
  list, ls

Flags:
  -A, --all-namespaces     If present, list the Bytewax dataflows across all namespaces
  -h, --help               help for list
  -n, --namespace string   Namespace of the Bytewax dataflows to list

Global Flags:
      --debug   enable verbose output

Updating a Dataflow#

To modify a dataflow configuration you need to run dataflow deploy again setting the new desired flags configuration.

Continuing with our example, we could set three processes instead of one running this:

$ waxctl df deploy --name my-dataflow /var/bytewax/examples/basic.py -p 3
Dataflow my-dataflow deployed in bytewax namespace.

And if we get the list of dataflows we are going to see the new configuration applied:

$ waxctl df ls
[
  {
    "name": "my-dataflow",
    "namespace": "bytewax",
    "containerImage": "bytewax/bytewax:latest",
    "containerImagePullSecret": "default-credentials",
    "pythonScriptFile": "/var/bytewax/basic.py",
    "processes": "3",
    "processesReady": "3",
    "workersPerProcess": "1",
    "creationTimestamp": "2022-04-18T10:51:03-03:00"
  }
]

You can change any of the flags of your dataflow.

Removing a Dataflow#

To remove the Kubernetes resources of a dataflow you need to run waxctl dataflow delete passing the name of the dataflow.

To run a dry-run delete of our dataflow example we can run this:

$ waxctl df rm --name my-dataflow
Dataflow my-dataflow found in bytewax namespace.

Must specify --yes to delete it.

And if we want to actually delete the dataflow we need add the --yes flag:

$ waxctl df rm --name my-dataflow --yes
Dataflow my-dataflow deleted.

Bytewax Helm Chart#

Waxctl uses a compiled-in Bytewax helm chart to generate all the manifests except the Namespace and ConfigMap which are created by making calls to Kubernetes API directly.

More Examples#

This section covers some of the most common scenarios.

Setting Processes and Workers per Process#

Deploying a dataflow running five processes and two workers per process:

$ waxctl dataflow deploy my-script.py \
  --name=cluster \
  --processes=5 \
  --workers=2

Using available alias of commands and flags, you can get the same running:

$ waxctl df deploy my-script.py -Ncluster -p5 -w2

Note that when you use one-character flags you can ommit the = or the space between the flag and the value.

Using a tar file to work with a tree of directories and files#

When your python script needs to read other files you can manage that by creating a tarball file with all the files needed and passing the path of that file as the argument to the deploy command. Also, you will need to add the flag --python-file-name (or its equivalent alias -f) with the relative path of your python script inside the tarball.

$ waxctl df deploy my-tree.tar -f ./my-script.py --name my-dataflow

Dry-run Flag#

Running a command with the --dry-run flag to get all the manifests without actually deploy them to the Kubernetes cluster:

$ waxctl df deploy ./basic.py -N dataflow -n new-namespace --dry-run

Setting the -n flag with the name of a non-existing namespace will cause Waxctl to handle creating the resource and because we use the --dry-run flag it is going to include the namespace manifest in the output.

You could also get the output in JSON format:

$ waxctl df deploy ./basic.py -N dataflow -n new-namespace --dry-run -ojson

And if you need to apply the output manifests to Kubernetes you could run the following:

$ waxctl df deploy ./basic.py -N dataflow -n new-namespace --dry-run | kubectl apply -f -

Using a Custom Image from a Private Registry#

In our Including Custom Dependencies in an Image section we describe how to create your own Docker image using a Bytewax image as base.

Let’s assume you created a custom image that you pushed to a private image registry in GitLab. In our example the image URL is:

registry.gitlab.com/someuser/somerepo:bytewax-example

First, we would create a namespace and a Kubernetes docker-registry secret with the registry credentials:

$ export GITLAB_PAT=your-personal-access-token
$ kubectl create ns private
$ kubectl create secret docker-registry gitlab-credentials \
  -n private \
  --docker-server=https://registry.gitlab.com \
  --docker-username=your-email@mail.com \
  --docker-password=$GITLAB_PAT \
  --docker-email=your-email@mail.com

After that you can deploy your dataflow with these flags:

$ waxctl df deploy ./my-script.py \
  --name df-private \
  --image-repository registry.gitlab.com/someuser/somerepo \
  --image-tag bytewax-example \
  --image-pull-secret gitlab-credentials \
  --namespace private

Using waxctl to Run your Dataflow on Kubernetes#