Using waxctl
to Run your Dataflow on AWS#
As well as facilitating deployment on Kubernetes, Waxctl provides an easy path to deploy dataflows on AWS EC2 instances.
Installation#
To install Waxctl simply download the binary corresponding to your operating system and architecture here.
Dataflow Lifecycle#
You can manage the lifecycle of your dataflow with Waxctl across the following phases:
Deployment
Current Status (Running)
Restarting
Deletion
In the following sections we are going to cover each of these phases.
Available AWS Sub-commands#
When using Waxctl with cloud specific resources, you will add the
cloud name in front of the sub-command. The aws
command has the
following sub-commands:
deploy
list
stop
start
delete
Running waxctl aws --help
will show further details for these.
$ waxctl aws --help
Manage dataflows running on AWS EC2 instances
Usage:
waxctl aws [command]
Available Commands:
delete terminate an EC2 instance created by waxctl
deploy create an EC2 instance running a dataflow
list list AWS EC2 instances created by waxctl
start start an EC2 instance created by waxctl
stop stop an EC2 instance created by waxctl
Flags:
-h, --help help for aws
Global Flags:
--debug enable verbose output
Use "waxctl aws [command] --help" for more information about a command.
Waxctl and AWS CLI#
To use Waxctl on AWS you need to have installed and configured the AWS CLI. You can follow the instructions here.
By default, Waxctl is going to use the AWS CLI default region but you
can use another one just by setting the --region
flag in your Waxctl
commands.
Run a Dataflow in an EC2 instance#
To deploy a dataflow you just need to run waxctl aws deploy
passing
the path of your python script as an argument. If left unset, Waxctl
will use the default name for your dataflow, which is bytewax
.
In our example we are going to deploy a dataflow called my-dataflow
:
$ waxctl aws deploy /var/bytewax/examples/basic.py --name my-dataflow
Created policy arn:aws:iam::111111111111:policy/Waxctl-EC2-my-dataflow-Policy
Created role Waxctl-EC2-my-dataflow-Role
Created my-dataflow instance with ID i-040f98b9160d2d158 running /var/bytewax/examples/basic.py script
In the above example, Waxctl used the default values for all of the
flags except for the name
flag. Waxctl allows you to configure a
wide range of characteristics of your dataflow.
As you can see in the output above, Waxctl created an IAM policy and role. That will allow the EC2 instance to store Cloudwatch logs and start sessions through Systems Manager.
We can see the complete list of available flags with the waxctl aws deploy
help command.
$ waxctl aws deploy -h
Deploy a dataflow to a new EC2 instance.
The deploy command expects one argument, which is the path of your python dataflow file.
By default, Waxctl creates a policy and a role that will allow the EC2 instance to store Cloudwatch logs and start sessions through Systems Manager.
Examples:
# The default command to deploy a dataflow named "bytewax" in a new EC2 instance running my-dataflow.py file.
waxctl aws deploy my-dataflow.py
# Deploy a dataflow named "custom" using specific security groups and instance profile
waxctl aws deploy dataflow.py --name custom \
--security-groups-ids "sg-006a1re044efb2d23" \
--principal-arn "arn:aws:iam::1111111111:instance-profile/my-profile"
Usage:
waxctl aws deploy [PATH] [flags]
Flags:
-P, --associate-public-ip-address associate a public IP address to the EC2 instance (default true)
-m, --detailed-monitoring specifies whether detailed monitoring is enabled for the EC2 instance
-e, --extra-tags strings extra tags to apply to the EC2 instance. The format must be KEY=VALUE
-h, --help help for deploy
-t, --instance-type string EC2 instance type to be created (default "t2.micro")
-k, --key-name string name of an existing key pair
-n, --name string name of the EC2 instance to deploy the dataflow (default "bytewax")
-p, --principal-arn string principal ARN to assign to the EC2 instance
--profile string AWS cli configuration profile
-f, --python-file-name string python script file to run. Only needed when [PATH] is a tar file
--region string AWS region
-r, --requirements-file-name string requirements.txt file if needed
--save-cloud-config save cloud-config file to disk for troubleshooting purposes
-S, --security-groups-ids strings security groups Ids to assign to the EC2 instance
-s, --subnet-id string the ID of the subnet to launch the EC2 instance into
Global Flags:
--debug enable verbose output
We suggest paying special attention to the requirements-file-name
flag because normally you will want to specify a requirements.txt
file with the needed libraries to run your dataflow program.
Default IAM Role#
As we mentioned, Waxctl creates an IAM policy and role to allow your EC2 instance to store CloudWatch logs and to start Systems Manager sessions. In case you need to use a custom IAM role, here we show you what are the permissions that the policy created by Waxctl has:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams"
],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel",
"ssm:UpdateInstanceInformation"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetEncryptionConfiguration"
],
"Resource": "*"
}
]
}
So, your role must have those permissions to keep both features working. We recommend attaching to your Role a Customer managed policy having only those permissions and maybe with an explicit name like “Bytewax-Policy” or “Waxctl-Policy”.
Getting Dataflow Information#
You can query which dataflows are deployed on EC2 instances in your
AWS account using the waxctl aws list
sub-command. By default the
output will be a table with this information:
$ waxctl aws ls
Dataflow Python File Name VM State Launch Time
my-dataflow /var/bytewax/examples/basic.py running 2022-10-03 13:34:02 +0000 UTC
You can use the --verbose
flag to get more details of each dataflow:
$ waxctl aws ls --verbose
[
{
"instanceId": "i-040f98b9160d2d158",
"instanceType": "t2.micro",
"instanceState": "running",
"name": "my-dataflow",
"iamInstanceProfile": "arn:aws:iam::111111111111:instance-profile/Waxctl-EC2-my-dataflow-InstanceProfile",
"keyName": "",
"launchTime": "2022-10-03 13:34:02 +0000 UTC",
"detailedMonitoring": "disabled",
"privateDnsName": "ip-172-31-3-116.us-west-2.compute.internal",
"privateIpAddress": "172.31.3.116",
"pythonFileName": "/var/bytewax/examples/basic.py",
"publicDnsName": "ec2-18-237-167-156.us-west-2.compute.amazonaws.com",
"publicIpAddress": "18.237.167.156",
"securityGroupsIds": [
"sg-1e2c2c27"
],
"subnetId": "subnet-64840439",
"tags": [
{
"key": "bytewax.io/waxctl-version",
"value": "0.5.1"
},
{
"key": "Name",
"value": "my-dataflow"
},
{
"key": "bytewax.io/managed-by",
"value": "waxctl"
},
{
"key": "bytewax.io/waxctl-managed-instance-profile-name",
"value": "Waxctl-EC2-my-dataflow-InstanceProfile"
},
{
"key": "bytewax.io/waxctl-managed-policy-arn",
"value": "arn:aws:iam::111111111111:policy/Waxctl-EC2-my-dataflow-Policy"
},
{
"key": "bytewax.io/waxctl-managed-role-name",
"value": "Waxctl-EC2-my-dataflow-Role"
},
{
"key": "bytewax.io/waxctl-python-filename",
"value": "/var/bytewax/examples/basic.py"
}
],
"consoleDetails": "https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#InstanceDetails:instanceId=i-040f98b9160d2d158",
"logs": "https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Fbytewax$252Fsyslog/log-events/my-dataflow$3FfilterPattern$3Dbytewax"
}
]
As you can see, there are a lot of details including links to access instance information and your dataflow logs in the AWS web console.
This is the help text of the list
command:
$ waxctl aws ls --help
List EC2 instances created by waxctl.
Examples:
# List EC2 instances running dataflows.
waxctl aws list
# List EC2 instance named "my dataflow" in region us-east-1.
waxctl aws list --name "my dataflow" --region us-east-1
Usage:
waxctl aws list [flags]
Aliases:
list, ls
Flags:
-h, --help help for list
-n, --name string name of the EC2 instance to find.
--profile string AWS cli configuration profile.
--region string AWS region.
-v, --verbose return detailed information of the EC2 instances.
Global Flags:
--debug enable verbose output
Stopping and Starting a Dataflow#
In case you need to pause your dataflow, there are two commands to
manage that: stop
and start
. These control the EC2 instance state
where your dataflow is running.
Following the example, you can stop the dataflow EC2 instance with this command:
$ waxctl aws stop --name my-dataflow
EC2 instance my-dataflow with ID i-040f98b9160d2d158 stopped.
So, if you run the list
command, you will see something like this:
$ waxctl aws ls
Dataflow Python File Name VM State Launch Time
my-dataflow /var/bytewax/examples/basic.py stopping 2022-10-03 13:34:02 +0000 UTC
And a few seconds later, the list
command is going to show the state
stopped
:
$ waxctl aws ls
Dataflow Python File Name VM State Launch Time
my-dataflow /var/bytewax/examples/basic.py stopped 2022-10-03 13:34:02 +0000 UTC
You can use the start
command to start the EC2 instance again:
$ waxctl aws start --name my-dataflow
EC2 instance my-dataflow with ID i-040f98b9160d2d158 started.
You can change any of the flags of your dataflow.
Removing a Dataflow#
To terminate the EC2 instance where your dataflow is running you need
to run waxctl aws delete
while passing the name of the dataflow as a
parameter.
To run a dry-run delete of our dataflow example we can run this:
$ waxctl aws delete --name my-dataflow
EC2 instance my-dataflow with ID i-040f98b9160d2d158 found in us-west-2 region.
--yes flag is required to terminate it.
And if we want to actually delete the dataflow we must add the --yes
flag:
$ waxctl aws rm --name my-dataflow --yes
Role Waxctl-EC2-my-dataflow-Role deleted.
Policy arn:aws:iam::111111111111:policy/Waxctl-EC2-my-dataflow-Policy deleted.
EC2 instance my-dataflow with ID i-040f98b9160d2d158 has been terminated.
Note that we used rm
in the last command, which is an alias of
delete
. Many of the Waxctl sub-commands have an alias and you can
see them in the help.
How it works internally#
As you can imagine, Waxctl uses the AWS API to manage EC2 instances, IAM policies, and roles.
The operating system of EC2 instances created by Waxctl is Ubuntu 20.04 LTS.
Waxctl relies on
Cloud-init, a standard
multi-distribution method for cross-platform cloud instance
initialization. Using Cloud-init, Waxctl configures a Linux service
which is going to run pip install -r /home/ubuntu/bytewax/requirements.txt
and after that run your python
dataflow program.
As we mentioned before, you can specify your own requirements.txt
file using the --requirements-file-name
flag. If you don’t, Waxctl
is going to put only bytewax
as a requirement.
Besides setting the service that runs your dataflow, Waxctl configures settings to push the syslog logs to CloudWatch. With that enabled, you can see your dataflow stdout and stderr in CloudWatch.
Troubleshooting#
You have two ways to see what’s going on with your dataflow program: viewing logs and connect to the EC2 instance.
Logs#
Since Waxctl runs your dataflow program a Linux service and all syslog logs are sent to CloudWatch, you can see your dataflow logs directly in CloudWatch.
When you run waxctl aws ls --verbose
you get a link to CloudWatch
Logs Viewer in AWS Web Console filtering by your EC2 instance and your
dataflow. Like this:
...
"logs": "https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Fbytewax$252Fsyslog/log-events/my-dataflow$3FfilterPattern$3Dbytewax"
...
Connecting to the EC2 Instance#
By default Waxctl will create an IAM Policy and a IAM Role to allow
you to use AWS Systems Manager to connect to the EC2 instance. To do
that, you need to know the instance ID (you can get it running waxctl aws ls --verbose
) and then run this:
$ aws ssm start-session --target i-0a04d5e18c1df4c90
You may want to check:
/home/ubuntu/bytewax - where your requirements and python script are copied.
systemctl status bytewax-dataflow.service
- Linux service that runs your dataflow.df -H /
- File system information.top
- Processes information.
You can install any profiling or debugging tool and use it. Also you
could modify your script and restart the bytewax-dataflow.service
running:
systemctl restart bytewax-dataflow.service
A Production-like Example#
In case your dataflow needs access to other AWS managed services, like MKS, you will probably want to use your own Security Group and IAM configuration.
In a production environment, the EC2 instance should be running in a specific Subnet, commonly a private one so we are going to instruct Waxctl to not associate a public IP address and to create the EC2 in a concrete Subnet.
So, this is an example Waxctl command for the described scenario:
$ waxctl aws deploy my-dataflow.py \
--name=production-dataflow \
--requirements-file-name=requirements.txt \
--instance-type=m5.xlarge \
--security-groups-ids=sg-0eae80fc9370624f7 \
--principal-arn=arn:aws:iam::111111111111:instance-profile/production-profile \
--subnet-id=subnet-03ec94a8d7d22d8e9 \
--debug
The output of that deploy will be like this:
2022/10/05 10:31:05 Analylics - information to send:
{"waxctl_version":"0.5.1","platform":"amd64","os":"linux","command":"aws","subcommand":"deploy"}
2022/10/05 10:31:05 Analylics - duration: 942.50645ms
2022/10/05 10:31:05 Validating parameters...
2022/10/05 10:31:05 Getting AWS region us-west-2 from default configuration
2022/10/05 10:31:05 Getting AWS CLI configuration...
2022/10/05 10:31:07 Creating cloud-init config file...
2022/10/05 10:31:08 Selected AMI - Name: ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20220924 - ImageId: ami-07eeacb3005b9beae
Created production-dataflow instance with ID i-0409c71c2cc1b249f running my-dataflow.py script