Using `waxctl` to Run your Dataflow on GCP#

Waxctl allows you to deploy dataflows on Google Cloud Platform VM instances.

Installation#

To install Waxctl simply download the binary corresponding to your operating system and architecture here.

Dataflow Lifecycle#

You can manage the lifecycle of your dataflow with Waxctl across the following phases:

Deployment
Current Status (Running)
Restarting
Deletion

In the following sections we are going to cover each of these phases.

Available GCP Sub-commands#

When using Waxctl with cloud specific resources, you will add the cloud name in front of the sub-command. The gcp command has the following sub-commands:

deploy
list
stop
start
delete

Running waxctl gcp --help will show further details for these.

$ waxctl gcp --help
Manage dataflows running on GCP VM instances.

Prerequisites:
- Enable Compute Engine API
- Enable Cloud Resource Manager API
- Set up Application Default Credentials running: gcloud auth application-default login

Usage:
  waxctl gcp [flags]
  waxctl gcp [command]

Available Commands:
  delete      delete a VM instance created by waxctl
  deploy      create VM instance running a dataflow
  list        list VM instances created by waxctl
  start       start a VM instance created by waxctl
  stop        stop a VM instance created by waxctl

Flags:
  -h, --help   help for gcp

Global Flags:
      --debug   enable verbose output

Use "waxctl gcp [command] --help" for more information about a command.

Waxctl and Google Cloud CLI#

To use Waxctl on GCP, first, you need to have installed and configured the gcloud CLI. You can follow the instructions here.

After that, you need to set up Application Default Credentials running:

$ gcloud auth application-default login

Finally, you must check if these APIs are enabled:

NAME                                 TITLE
cloudresourcemanager.googleapis.com  Cloud Resource Manager API
compute.googleapis.com               Compute Engine API

To know that, you can run this:

$ gcloud services list --enabled

If you don’t have any of them enabled, you will need to run this:

$ gcloud services enable SERVICE_NAME

For example, to enable Compute Engine API you will run:

$ gcloud services enable compute.googleapis.com

By default, Waxctl is going to use the project configured in the application default credentials that you defined but you can use another one just by setting the --project flag in your Waxctl commands.

Run a Dataflow in a VM instance#

To deploy a dataflow you just need to run waxctl gcp deploy passing the path of your python script as an argument and the desired GCP zone using the --zone flag. If left unset, Waxctl will use the default name for your dataflow, which is bytewax.

In our example we are going to deploy a dataflow called my-dataflow:

$ waxctl gcp deploy /var/bytewax/examples/basic.py --name my-dataflow --zone us-central1-a
Created custom role projects/my-project/roles/my_dataflow
Created service account my-dataflow@my-project.iam.gserviceaccount.com
Created policy binding service account my-dataflow@my-project.iam.gserviceaccount.com with role projects/my-project/roles/my_dataflow
Created my-dataflow instance in zone us-central1-a of project my-project running /var/bytewax/examples/basic.py script.

In the above example, Waxctl used the default values for all of the flags except for name and zone. Waxctl allows you to configure a wide range of characteristics of your dataflow.

As you can see in the output above, Waxctl created an IAM role and service account. That will allow the VM instance to store StackDriver logs.

We can see the complete list of available flags with the waxctl gcp deploy help command.

$ waxctl gcp deploy --help
Deploy a dataflow to a new GCP VM instance.

The deploy command expects one argument, which is the path of your python dataflow file.
By default, Waxctl creates a role and a service account that will allow the VM instance to store StackDriver logs.

Examples:
  # The default command to deploy a dataflow named "bytewax" in a new VM instance running my-dataflow.py file.
  waxctl gcp deploy my-dataflow.py --zone us-central1-a

  # Deploy a dataflow named "custom" using specific network tags and service account
  waxctl gcp deploy dataflow.py --name custom --zone us-central1-a \
    --tags "dataflow,production" \
    --service-account-email "dataflows-service-account@my-project.iam.gserviceaccount.com"

Usage:
  waxctl gcp deploy [PATH] [flags]

Flags:
  -P, --associate-public-ip-address     associate a public IP address to the VM instance (If it's false, need to set up a Cloud NAT in the VPC because VM must access the internet) (default true)
  -e, --extra-labels strings            extra labels to apply to the VM instance. The format must be key=value (both key and value can only contain lowercase letters, numeric characters, underscores and dashes.)
  -h, --help                            help for deploy
  -t, --machine-type string             VM instance machine type to be created (default "n1-standard-1")
  -n, --name string                     name of the VM instance to deploy the dataflow (must consist of lowercase letters (a-z), numbers, and hyphens) (default "bytewax")
  -p, --project-id string               GCP project ID
  -f, --python-file-name string         python script file to run. Only needed when [PATH] is a tar file
  -r, --requirements-file-name string   requirements.txt file if needed
      --save-cloud-config               save cloud-config file to disk for troubleshooting purposes
  -S, --service-account-email string    service account to assign to the VM instance
  -s, --subnetwork-link string          the link of the subnetwork to launch the VM instance into (The format must be: regions/<REGION>/subnetworks/<SUBNETWORK_NAME>)
  -T, --tags strings                    network tags to assing to the VM instance (The format must be: tag1,tag2,tag3)
  -z, --zone string                     GCP zone

Global Flags:
      --debug   enable verbose output

We suggest paying special attention to the requirements-file-name flag because normally you will want to specify a requirements.txt file with the needed libraries to run your dataflow program.

Default IAM Service Account#

As we mentioned, Waxctl creates an IAM service account to allow your VM instance to store StackDriver logs. In case you need to use a custom IAM service account, the only permission that you need to assing to your service account is logging.logEntries.create

We recommend to create an exclusive role having only that permission and maybe with an explicit name like “Bytewax-Role” or “Waxctl-Role”. Then, you need to bind that role to your custom service account.

Getting Dataflow Information#

You can query which dataflows are deployed on VM instances in your GCP account using the waxctl gcp list sub-command. By default the output will be a table with this information:

$ waxctl gcp ls
Dataflow    Python File Name VM State Zone          Launch Time (UTC)
my-dataflow basic-py         RUNNING  us-central1-a 2022-10-13T07:01:34.833-07:00

You can use the --verbose flag to get more details of each dataflow:

$ waxctl gcp ls --verbose
[
  {
    "instanceId": 4460149001200968897,
    "instanceType": "https://www.googleapis.com/compute/v1/projects/my-project/zones/us-central1-a/machineTypes/n1-standard-1",
    "instanceState": "RUNNING",
    "name": "my-dataflow",
    "serviceAccounts": [
      "my-dataflow@my-project.iam.gserviceaccount.com"
    ],
    "launchTime": "2022-10-13T07:01:34.833-07:00",
    "privateIpAddress": "10.128.0.38",
    "pythonFileName": "basic-py",
    "publicIpAddress": "34.135.142.229",
    "networkTags": [],
    "subnetworkLink": "https://www.googleapis.com/compute/v1/projects/my-project/regions/us-central1/subnetworks/default",
    "zone": "https://www.googleapis.com/compute/v1/projects/my-project/zones/us-central1-a",
    "labels": [
      {
        "key": "bytewax_io-managed_by",
        "value": "waxctl"
      },
      {
        "key": "bytewax_io-waxctl_dataflow_name",
        "value": "my-dataflow"
      },
      {
        "key": "bytewax_io-waxctl_managed_role_name",
        "value": "my_dataflow"
      },
      {
        "key": "bytewax_io-waxctl_managed_service_account_id",
        "value": "110115571217054613003"
      },
      {
        "key": "bytewax_io-waxctl_python_filename",
        "value": "basic-py"
      },
      {
        "key": "bytewax_io-waxctl_version",
        "value": "0-5-1"
      }
    ],
    "consoleDetails": "https://console.cloud.google.com/compute/instancesDetail/zones/us-central1-a/instances/my-dataflow?project=my-project",
    "logs": "https://console.cloud.google.com/logs/query;query=%2528resource.type%3D%22gce_instance%22%0Aresource.labels.instance_id%3D%224460149001200968897%22%2529%0Arun-bytewax-dataflow.sh%0A;timeRange=PT10H?project=my-project"
  }
]

As you can see, there are a lot of details including links to access instance information and your dataflow logs in the GCP web console.

This is the help text of the list command:

$ waxctl gcp ls --help
List VM instances created by waxctl.

Examples:
  # List VM instances running dataflows in us-central1-a zone.
  waxctl gcp list --zone us-central1-a

  # List VM instance named "my dataflow" in zone us-west1-a.
  waxctl gcp list --name "my dataflow" --zone us-west1-a

Usage:
  waxctl gcp list [flags]

Aliases:
  list, ls

Flags:
  -h, --help               help for list
  -n, --name string        name of the VM instance to find.
  -p, --projectId string   GCP project Id.
  -v, --verbose            return detailed information of the VM instances.
  -z, --zone string        GCP zone.

Global Flags:
      --debug   enable verbose output

Stopping and Starting a Dataflow#

In case you need to pause your dataflow, there are two commands to manage that: stop and start. These control the VM instance state where your dataflow is running.

Following the example, you can stop the dataflow VM instance with this command:

$ waxctl gcp stop --name my-dataflow --zone us-central1-a
Stopping VM instance my-dataflow in us-central1-a zone in project my-project...
VM instance my-dataflow with ID 4460149001200968897 stopped.

So, if you run the list command, you will see something like this:

$ waxctl gcp ls
Dataflow    Python File Name VM State   Zone          Launch Time (UTC)
my-dataflow basic-py         TERMINATED us-central1-a 2022-10-13T07:01:34.833-07:00

You can use the start command to start the VM instance again:

$ waxctl gcp start --name my-dataflow --zone us-central1-a
Starting VM instance my-dataflow in us-central1-a zone in project my-project...
VM instance my-dataflow with ID 4460149001200968897 started.

You can change any of the flags of your dataflow.

Removing a Dataflow#

To terminate the VM instance where your dataflow is running you need to run waxctl gcp delete while passing the name of the dataflow as a parameter.

To run a dry-run delete of our dataflow example we can run this:

$ waxctl gcp delete --name my-dataflow --zone us-central1-a
VM instance my-dataflow with ID 4460149001200968897 found in us-central1-a zone.

 --yes flag is required to delete it.

And if we want to actually delete the dataflow we must add the --yes flag:

$ waxctl gcp delete --name my-dataflow --zone us-central1-a --yes
Deleting my-dataflow instance from project my-project. It could take some minutes, please wait...
Role my_dataflow deleted.
Service Account my-dataflow@my-project.iam.gserviceaccount.com deleted.
VM instance my-dataflow with ID 4460149001200968897 has been deleted from project my-project.

Note that we used rm in the last command, which is an alias of delete. Many of the Waxctl sub-commands have an alias and you can see them in the help.

How it works internally#

As you can imagine, Waxctl uses the GCP API to manage VM instances, IAM roles, and service accounts.

The operating system of EC2 instances created by Waxctl is Ubuntu 20.04 LTS.

Waxctl relies on Cloud-init, a standard multi-distribution method for cross-platform cloud instance initialization. Using Cloud-init, Waxctl configures a Linux service which is going to run pip install -r /home/ubuntu/bytewax/requirements.txt and after that run your python dataflow program.

As we mentioned before, you can specify your own requirements.txt file using the --requirements-file-name flag. If you don’t, Waxctl is going to put only bytewax as a requirement.

Besides setting the service that runs your dataflow, Waxctl configures settings to push the syslog logs to StackDriver. With that enabled, you can see your dataflow stdout and stderr in Log Explorer.

Troubleshooting#

You have two ways to see what’s going on with your dataflow program: viewing logs and connect to the VM instance.

Logs#

Since Waxctl runs your dataflow program a Linux service and all syslog logs are sent to StackDriver, you can see your dataflow logs directly in Log Explorer.

When you run waxctl gcp ls --verbose you get a link to Log Explorer in GCP web console filtering by your VM instance and your dataflow. Like this:

...
    "logs": "https://console.cloud.google.com/logs/query;query=%2528resource.type%3D%22gce_instance%22%0Aresource.labels.instance_id%3D%22760540438160841495%22%2529%0Arun-bytewax-dataflow.sh%0A;timeRange=PT10H?project=my-project"
...

Connecting to the VM Instance#

If your GCP network configuration allows your VM instance to receive traffic in port 22, you can run this command to connect to the instance:

$ gcloud compute ssh \
  --zone "us-central1-a" "ubuntu@my-dataflow" \
  --project "my-project"

About port 22, Waxctl by default uses the default subnet of the default VPC (also called Network) in the zone that you specified in your project. Normally that Network has a firewall rule default-allow-ssh which set what its name says: allows ingress traffic in port 22 from everywhere.

So, if you want to use gcloud compute ssh to connect to your VM instance and you are going to use a custom subnet, you need to have a similar firewall configuration in that network (maybe without having 0.0.0.0\0 as source CIDR).

Once you are connected to your instance, you may want to check:

/home/ubuntu/bytewax - where your requirements and python script are copied.
systemctl status bytewax-dataflow.service - Linux service that runs your dataflow.
df -H / - File system information.
top - Processes information.

You can install any profiling or debugging tool and use it. Also you could modify your script and restart the bytewax-dataflow.service running:

systemctl restart bytewax-dataflow.service

A Production-like Example#

In case your dataflow needs access to other GCP-managed services, like BigQuery, you will probably want to use your network tags, IAM configuration, and custom subnet.

In a production environment, the VM instance should be running in a specific subnet, commonly a private one so we are going to instruct Waxctl to not associate a public IP address and to create the VM instance in a concrete subnet. Take note that in that scenario, you will need to set up a Cloud NAT in the VPC of that subnet because the VM needs to access the internet.

So, this is an example Waxctl command for the described scenario:

$ waxctl gcp deploy --associate-public-ip-address=false \
  /var/bytewax/examples/production.py \
  --name "production-dataflow" \
  --zone us-central1-a \
  --requirements-file-name /var/bytewax/examples/requirements.txt \
  --extra-labels environment=production \
  --service-account-email my-service-account@my-project.iam.gserviceaccount.com\
  --tags dataflow,production,web \
  --subnetwork-link projects/my-project/regions/us-central1/subnetworks/mysubnet \
  --debug

The output of that deploy will be like this:

2022/10/13 15:19:44 Analylics - information to send:
{"waxctl_version":"0.5.1","platform":"amd64","os":"linux","command":"gcp","subcommand":"deploy"}
2022/10/13 15:19:45 Analylics - duration: 861.582911ms
2022/10/13 15:19:45 replace result: production-dataflow-2
2022/10/13 15:19:45 Validating parameters...
2022/10/13 15:19:45 No project ID specified, trying to get default project ID...
2022/10/13 15:19:45 Using project ID found in application detault credencials: my-project
2022/10/13 15:19:45 Filter: (labels.bytewax_io-managed_by = waxctl) AND (name = production-dataflow-2)
2022/10/13 15:19:46 Creating cloud-init config file...
Created production-dataflow-2 instance in zone us-central1-a of project my-project running /var/bytewax/examples/production.py script.

Using waxctl to Run your Dataflow on GCP#