Using waxctl
to Run your Dataflow on Kubernetes#
Waxctl helps you run and manage your Bytewax dataflows in Kubernetes.
It uses the current kubectl
context configuration, so you need
kubectl configured to access the desired Kubernetes cluster for your
dataflows.
Check kubectl
documentation
here if you don’t
have it installed yet.
Installation#
To install Waxctl visit here.
Dataflow Lifecycle#
Waxctl allows you to manage the entire dataflow program lifecycle which includes these phases:
Deployment
Getting Status
Modification
Deletion
In the following sections we are going to cover each one.
Available Commands#
For now Waxctl only has one available command and that is dataflow
.
That command is also aliased to df
so any of them can be used.
You can manage the entire lifecycle of a dataflow in Kubernetes. To do
so, the dataflow
command has the following sub-commands:
delete
deploy
list
Running dataflow --help
will show further details for these:
$ waxctl dataflow --help
Waxctl Pro version 0.12.0
Manage dataflows running on Kubernetes.
Waxctl uses the current kubectl context configuration, so you need kubectl configured to access the desired Kubernetes cluster for your dataflows.
Usage:
waxctl dataflow [flags]
waxctl dataflow [command]
Aliases:
dataflow, df, k8s
Available Commands:
delete delete a dataflow
deploy deploy a dataflow to Kubernetes creating or upgrading it resources
list list dataflows deployed
Flags:
-h, --help help for dataflow
Global Flags:
--debug enable verbose output
Use "waxctl dataflow [command] --help" for more information about a command.
Waxctl and Namespaces#
Waxctl behavior leverages namespaces similarly to other kubernetes
tools like helm. If you don’t specify a namespace, the tool will use
the current kubectl
namespace.
You can specify an explicit namespace in every Waxctl command using
the --namespace
flag.
Deploying a Dataflow#
To deploy a dataflow you just need to run dataflow deploy
passing
the path of your python script as an argument. If left unset, Waxctl
will use the default name for your dataflow, which is bytewax
.
In our example we are going to deploy a dataflow called my-dataflow
in the current namespace which is bytewax
in our case:
$ waxctl df deploy /var/bytewax/examples/basic.py --name my-dataflow
Dataflow my-dataflow deployed in bytewax namespace.
In the above example, Waxctl used the default values for all the flags
besides name
. The tool allows you to configure a wide range of
characteristics of your dataflow.
We can see which flags are available geting the dataflow deploy
help.
$ waxctl df deploy --help
Waxctl Pro version 0.12.0
Deploy a dataflow to Kubernetes using the Bytewax helm chart.
The resources are going to be created if the dataflow doesn't exist or upgraded if the dataflow is already deployed in the Kubernetes cluster.
The deploy command expects only one argument, a path or URI of a file which could be a python script or a tar file which must contain your script file (and normally one or more files needed by your script):
Examples:
# Deploy a dataflow in current namespace running my-script.py file.
waxctl dataflow deploy my-script.py
# Deploy a dataflow in current namespace running dataflow.py file.
waxctl dataflow deploy https://raw.githubusercontent.com/my-user/my-repo/main/dataflow.py
# Deploy a dataflow in my-namespace namespace running my-script.py file.
waxctl dataflow deploy my-script.py --namespace my-namespace
# Deploy a dataflow running my-script.py file contained in my-files.tar file.
waxctl dataflow deploy my-files.tar -f my-script.py
# Deploy a dataflow in current namespace running my-script.py file with the environment variable 'SERVER'.
waxctl dataflow deploy my-script.py -e SERVER=localhost:1433
Usage:
waxctl dataflow deploy [PATH] [flags]
Flags:
-V, --chart-values-file string load Bytewax helm chart values from a file. See more at https://bytewax.github.io/helm-charts
--create-namespace create namespace if it does not exist (default true)
--dry-run output the manifests generated without installing them to Kubernetes
-e, --environment-variables strings environment variables to inject to dataflow container. The format must be KEY=VALUE
-h, --help help for deploy
-l, --image-pull-policy string custom container image pull policy (the value must be Always, IfNotPresent or Never) (default "Always")
-s, --image-pull-secret string kubernetes secret name to pull custom image (default "default-credentials")
-i, --image-repository string custom container image repository URI (default "bytewax/bytewax")
-t, --image-tag string custom container image tag (default "latest")
--job-mode run a Job in Kubernetes instead of an Statefulset. Use this to batch processing
--keep-alive keep the container alive after dataflow ends. It could be useful for troubleshooting purposes
-N, --name string name of the dataflow to deploy in Kubernetes (default "bytewax")
-n, --namespace string namespace of Kubernetes resources to deploy
-o, --output-format output-format output format for --dry-run; can be 'yaml' or 'json' (default yaml)
--platform deploy the dataflow as a bytewax.io/dataflow Custom Resource (requires Bytewax Platform installed)
-p, --processes int number of processes to deploy (default 1)
-f, --python-file-name string python script file to run. Only needed when [PATH] is a tar file
--recovery stores recovery files in Kubernetes persistent volumes to allow resuming after a restart (your dataflow must have recovery enabled: https://bytewax.io/docs/getting-started/recovery)
--recovery-backup back up worker state DBs to cloud storage (must have recovery flag present and provide S3 parameters)
--recovery-backup-interval int System time duration in seconds to keep extra state snapshots around
--recovery-backup-s3-aws-access-key-id string AWS credentials access key id
--recovery-backup-s3-aws-secret-access-key string AWS credentials secret access key
--recovery-backup-s3-k8s-secret string name of the Kubernetes secret storing AWS credentials.
--recovery-backup-s3-url string s3 url to store state backups. For example, s3://mybucket/mydataflow-state-backups
--recovery-parts int number of recovery parts (default 1)
--recovery-single-volume use only one persistent volume for all dataflow's pods in Kubernetes
--recovery-size string size of the persistent volume claim to be assign to each dataflow pod in Kubernetes (default "10Gi")
--recovery-snapshot-interval int system time duration in seconds to snapshot state for recovery
--recovery-storageclass string storage class of the persistent volume claim to be assign to each dataflow pod in Kubernetes
-r, --requirements-file-name string requirements.txt file if needed
-v, --values string load parameter values from a config file (supported formats: JSON, TOML, YAML, HCL, envfile and Java properties)
-w, --workers int number of workers to run in each process (default 1)
--yes confirm the update and restart of the dataflow in the Kubernetes cluster
Global Flags:
--debug enable verbose output
Getting Dataflow Information#
You can know which dataflows are deployed in your Kubernetes cluster
using the dataflow list
sub-command. In the output we are going to
find details about each dataflow as we can see in this example:
$ waxctl df ls
Dataflow Namespace Python File Image Processes Source Creation Time
my-dataflow bytewax basic.py bytewax/bytewax:latest 1/1 waxctl 2024-08-20T08:32:40-03:00
This is the help text of that command:
$ waxctl df list --help
Waxctl Pro version 0.12.0
List dataflows deployed.
Examples:
# List dataflows in current namespace
waxctl dataflow list
# List dataflows in bytewax namespace
waxctl dataflow ls -n bytewax
# List dataflows across all namespaces
waxctl dataflow ls -A
Usage:
waxctl dataflow list [flags]
Aliases:
list, ls
Flags:
-A, --all-namespaces If present, list the Bytewax dataflows across all namespaces.
-h, --help help for list
-N, --name string name of the dataflow to display.
-n, --namespace string Namespace of the Bytewax dataflows to list.
-v, --verbose return detailed information of the dataflows.
Global Flags:
--debug enable verbose output
Updating a Dataflow#
To modify a dataflow configuration you need to run dataflow deploy
again setting the new desired flags configuration.
Continuing with our example, we could set three processes instead of one running this:
$ waxctl df deploy --name my-dataflow /var/bytewax/examples/basic.py -p 3 --yes
Dataflow my-dataflow deployed in bytewax namespace.
And if we get the list of dataflows we are going to see the new configuration applied:
$ waxctl df ls
Dataflow Namespace Python File Image Processes Source Creation Time
my-dataflow bytewax basic.py bytewax/bytewax:latest 3/3 waxctl 2024-08-20T08:32:40-03:00
You can change any of the flags of your dataflow.
Removing a Dataflow#
To remove the Kubernetes resources of a dataflow you need to run
waxctl dataflow delete
passing the name of the dataflow.
To run a dry-run delete of our dataflow example we can run this:
$ waxctl df rm my-dataflow
Dataflow my-dataflow found in bytewax namespace.
Must specify --yes to delete it.
And if we want to actually delete the dataflow we need add the --yes
flag:
$ waxctl df rm my-dataflow --yes
Dataflow my-dataflow deleted.
Bytewax Helm Chart#
Waxctl uses a compiled-in Bytewax helm chart to generate all the manifests except the Namespace and ConfigMap which are created by making calls to Kubernetes API directly.
More Examples#
This section covers some of the most common scenarios.
Setting Processes and Workers per Process#
Deploying a dataflow running five processes and two workers per process:
$ waxctl dataflow deploy my-script.py \
--name=cluster \
--processes=5 \
--workers=2
Using available alias of commands and flags, you can get the same running:
$ waxctl df deploy my-script.py -Ncluster -p5 -w2
Note that when you use one-character flags you can ommit the =
or
the space between the flag and the value.
Using a tar file to work with a tree of directories and files#
When your python script needs to read other files you can manage that
by creating a tarball file with all the files needed and passing the
path of that file as the argument to the deploy
command. Also, you
will need to add the flag --python-file-name
(or its equivalent
alias -f
) with the relative path of your python script inside the
tarball.
$ waxctl df deploy my-tree.tar -f ./my-script.py --name my-dataflow
Dry-run Flag#
Running a command with the --dry-run
flag to get all the manifests
without actually deploy them to the Kubernetes cluster:
$ waxctl df deploy ./basic.py -N dataflow -n new-namespace --dry-run
Setting the -n
flag with the name of a non-existing namespace will
cause Waxctl to handle creating the resource and because we use the
--dry-run
flag it is going to include the namespace manifest in the
output.
You could also get the output in JSON format:
$ waxctl df deploy ./basic.py -N dataflow -n new-namespace --dry-run -ojson
And if you need to apply the output manifests to Kubernetes you could run the following:
$ waxctl df deploy ./basic.py -N dataflow -n new-namespace --dry-run | kubectl apply -f -
Using a Custom Image from a Private Registry - Requires Waxctl Pro#
In our Including Custom Dependencies in an Image section we describe how to create your own Docker image using a Bytewax image as base.
Let’s assume you created a custom image that you pushed to a private image registry in GitLab. In our example the image URL is:
registry.gitlab.com/someuser/somerepo:bytewax-example
First, we would create a namespace and a Kubernetes docker-registry
secret with the registry credentials:
$ export GITLAB_PAT=your-personal-access-token
$ kubectl create ns private
$ kubectl create secret docker-registry gitlab-credentials \
-n private \
--docker-server=https://registry.gitlab.com \
--docker-username=your-email@mail.com \
--docker-password=$GITLAB_PAT \
--docker-email=your-email@mail.com
After that you can deploy your dataflow with these flags:
$ waxctl df deploy ./my-script.py \
--name df-private \
--image-repository registry.gitlab.com/someuser/somerepo \
--image-tag bytewax-example \
--image-pull-secret gitlab-credentials \
--namespace private