Many customers use Amazon EC2 Spot Instances to save on their compute cost. In this step you will learn how to use Spot Instances to run batch jobs and how to configure your AWS Batch Jobs to handle infrastructure events like Spot interruption.
Amazon EC2 Spot Instances offer spare compute capacity available in the AWS cloud at steep discounts compared to On-Demand instances. Spot Instances enable you to optimize your costs on the AWS cloud and scale your application’s throughput up to 10X for the same budget.
Spot Instances can be interrupted by EC2 with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications, such as big data, containerized workloads, high-performance computing (HPC), stateless web servers, rendering, CI/CD and other test & development workloads.
This step depends on completing the previous one, if you haven’t done it, please go to Job Dependencies and complete it first. In this step you will re-run the last exercise of Submit Leader and Follower jobs with a dependency but this time with a separate job queue and a new compute environment that uses Spot Instances.
In this step, you will create a new Compute Environment to use Spot Instances to save on the compute cost and allow to run your batch jobs on a bigger scale when needed. You will be using the AWS CLI tool this time to learn more about how to interact with AWS Batch using the command tools.
Compute environments are Amazon ECS clusters consisting of one or more EC2 instance types, or simply as the number of vCPUs you want to use to run your jobs. For more information on the compute environments, see Compute Environments.
cd ~/environment/
export SUBNETS="$(aws ec2 describe-subnets | jq '.Subnets[].SubnetId'| sed '$!s/$/,/')"
export SECURITY_GROUP="$(aws ec2 describe-security-groups --filters Name=group-name,Values=default | jq '.SecurityGroups[].GroupId')"
cat > ce-spot.json << EOF
{
"computeEnvironmentName": "stress-ng-ce-spot",
"type": "MANAGED",
"state": "ENABLED",
"computeResources": {
"type": "SPOT",
"allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
"minvCpus": 0,
"maxvCpus": 256,
"desiredvCpus": 0,
"instanceTypes": [
"c5",
"m5",
"r5",
"optimal"
],
"subnets": [
${SUBNETS}
],
"securityGroupIds": [
${SECURITY_GROUP}
],
"instanceRole": "ecsInstanceRole",
"tags": {
"Name": "stress-ng batch spot"
},
"bidPercentage": 100
},
"tags": {
"Name": "stress-ng batch spot"
}
}
EOF
Let’s go through the file you just created and note the below:
Now let’s create the compute environment using the configurations file we just built.
aws batch create-compute-environment --cli-input-json file://ce-spot.json
aws batch describe-compute-environments | jq '.computeEnvironments[] |select(.computeEnvironmentName=="stress-ng-ce-spot")'
Let’s continue to the next step to set up a new job queue.
In this step, you will set up a job queue. You submit your jobs to a queue and they are dispatched to compute environment(s) in order of priority. If you want to learn more about job queues, see Job Queues.
aws batch create-job-queue --state ENABLED --job-queue-name stress-ng-queue-spot --priority 1 --compute-environment-order order=1,computeEnvironment=stress-ng-ce-spot
Note the following in the last command:
Once the job queue is created, run this command to check its State is ENABLED and Status is VALID.
aws batch describe-job-queues --query 'jobQueues[*].[jobQueueName,state,status]' --output table
Continue to the next step to set up a new job definition.
As in the previous example, you will submit the two jobs Leader and Follower and specify a dependency such that the Leader job runs first and the Follower array job will only start upon successful completion of the Leader job. You will continue to submit the Leader job to a queue that’s connected to On-Demand EC2 instances, assuming this job can’t be interrupted and we will avoid running it on Spot Instances.
The difference in this step, you will submit the Follower jobs to a separate queue that sends jobs to the Spot compute environment you just created. And to allow the Follower tasks to restart in case of failure due to Spot interruptions, you will update the job definition as per the below steps definition.
Execute the following commands to create a new job definition for the Follower job.
As a best practice for Spot interruptions handling: Set a Retry Strategy - as in the below command - which allows the task to restart if the instances running it gets terminated due to a Spot instance reclaim. Learn more about Retry strategies
cd ~/environment/dependency/follower
export STACK_NAME=BatchWorkshop
export EXECUTION_ROLE="$(aws cloudformation describe-stacks --stack-name $STACK_NAME --output text --query 'Stacks[0].Outputs[?OutputKey == `JobExecutionRole`].OutputValue')"
export EXECUTION_ROLE_ARN=$(aws iam get-role --role-name $EXECUTION_ROLE | jq -r '.Role.Arn')
export FOLLOWER_REPO=$(aws ecr describe-repositories --repository-names stress-ng-follower --output text --query 'repositories[0].[repositoryUri]')
cat > stress-ng-follower-spot-job-definition.json << EOF
{
"jobDefinitionName": "stress-ng-follower-spot-job-definition",
"type": "container",
"containerProperties": {
"image": "${FOLLOWER_REPO}",
"vcpus": 1,
"memory": 1024,
"jobRoleArn": "${EXECUTION_ROLE_ARN}",
"executionRoleArn": "${EXECUTION_ROLE_ARN}"
},
"retryStrategy": {
"attempts": 5,
"evaluateOnExit":
[{
"onStatusReason" :"Host EC2*",
"action": "RETRY"
},{
"onReason" : "*",
"action": "EXIT"
}]
}
}
EOF
aws batch register-job-definition --cli-input-json file://stress-ng-follower-spot-job-definition.json
Execute the following commands to create a JSON file of job options for the Follower job. Notice the Job Queue name is set to the Spot queue.
export STACK_NAME=BatchWorkshop
export STRESS_BUCKET="s3://$(aws cloudformation describe-stacks --stack-name $STACK_NAME --output text --query 'Stacks[0].Outputs[?OutputKey == `Bucket`].OutputValue')"
cat <<EOF > ./stress-ng-follower-spot-job.json
{
"jobName": "stress-ng-follower-spot",
"jobQueue": "stress-ng-queue-spot",
"arrayProperties": {
"size": 2
},
"jobDefinition": "stress-ng-follower-spot-job-definition",
"containerOverrides": {
"environment": [
{
"name": "STRESS_BUCKET",
"value": "${STRESS_BUCKET}"
}]
}
}
EOF
export STRESS_BUCKET="s3://$(aws cloudformation describe-stacks --stack-name $STACK_NAME --output text --query 'Stacks[0].Outputs[?OutputKey == `Bucket`].OutputValue')"
aws s3 rm ${STRESS_BUCKET} --recursive
### Submit the Leader job and determine its jobID.
cd ~/environment/dependency
export LEADER_JOB=$(aws batch submit-job --cli-input-json file://leader/stress-ng-leader-job.json)
echo "${LEADER_JOB}"
export LEADER_JOB_ID=$(echo ${LEADER_JOB} | jq -r '.jobId')
echo "${LEADER_JOB_ID}"
### Submit the Follower array job with a dependency on the Leader jobID.
export FOLLOWER_JOB=$(aws batch submit-job --cli-input-json file://follower/stress-ng-follower-spot-job.json --depends-on jobId="${LEADER_JOB_ID}",type="N_TO_N" --array-properties size=12)
export FOLLOWER_JOB_ID=$(echo ${FOLLOWER_JOB} | jq -r '.jobId')
echo "${FOLLOWER_JOB_ID}"
aws batch describe-jobs --jobs ${FOLLOWER_JOB_ID}
You will see the dependency on the Leader job in the returned job description. You can also view this dependency by navigating to a member task of the Follower job in the AWS Batch dashboard.
Your Leader job should complete successfully followed by the Follower job array and eventually the output from the 12 tasks of the job array will appear in the S3 bucket.
To test how the Follower jobs will react in case of Spot instance being interrupted: 1)Resubmit both jobs. 2) Wait for the Follower tasks to be in Running state. 3) Navigate to AWS EC2 Console and terminate the instances tagged with “stress-ng batch spot”.