How to Run GPU Workloads on ECS: Complete Implementation Guide

Table of Contents

Introduction

Running GPU workloads on Amazon ECS requires careful planning and specific configurations that differ from standard CPU-based deployments. Organizations need GPU computing for machine learning training, inference workloads, video processing, and scientific computing, but ECS GPU support comes with important limitations and requirements

ECS GPU support is only available through EC2 capacity providers, not Fargate. This means you must manage your own compute infrastructure, select appropriate GPU-enabled instance types, and configure the underlying AMI with proper drivers. The process involves setting up capacity providers with user data scripts, configuring task definitions with GPU resource requirements, and ensuring proper driver installation.

This guide covers the complete implementation process for running GPU workloads on ECS, including AMI selection, capacity provider configuration, and task definition setup.

Prerequisites and Limitations

ECS GPU Support Scope: - Only available through EC2 capacity providers - Fargate does not support GPU workloads - Requires GPU-enabled instance types (p3, p4, g4, g5 series) - AMI must include NVIDIA drivers and Docker GPU runtime

Instance Type Requirements:

  • p3.2xlarge and larger for Tesla V100 GPUs
  • p4d.24xlarge for A100 GPUs
  • g4dn.xlarge and larger for T4 GPUs
  • g5.xlarge and larger for A10G GPUs

 Architecture Overview

Key Configuration Points

Critical GPU Settings:

  1. User Data Script: Must include ECS_ENABLE_GPU_SUPPORT=true in /etc/ecs/ecs.config.
  2. Task Definition: Use resource_requirements with type = "GPU" and value = "1."
  3. Capacity Provider: Service must use the GPU capacity provider strategy. 4. AMI Selection: Deep Learning AMI includes pre-configured NVIDIA drivers

Step-by-Step Implementation

Step 1: Select the Right AMI

Choose an AMI that includes NVIDIA drivers and Docker GPU runtime support:

 Option 1: AWS Deep Learning AMI—pre-configured with CUDA, cuDNN, Docker, and NVIDIA Container Toolkit—includes ECS-optimized agent and supports multiple CUDA versions. - Available in most regions

Option 2: ECS-Optimized AMI with GPU Support - Base ECS AMI with GPU drivers added - Lighter weight than Deep Learning AMI - Requires manual driver installation

Option 3: Custom AMI—Build from scratch with specific driver versions. - Maximum control over software stack - Highest maintenance overhead

 Step 2: Configure Capacity Provider

# Data source for GPU instance types
data "aws_ec2_instance_type_offerings" "gpu_instances" {
  filter {
    name   = "instance-type"
    values = ["p3.2xlarge", "p4d.24xlarge", "g4dn.xlarge", "g5.xlarge"]
  }
  
  filter {
    name   = "location-type"
    values = ["availability-zone"]
  }
}

# Launch template with GPU support
resource "aws_launch_template" "gpu_ecs" {
  name_prefix   = "gpu-ecs-template"
  image_id      = "ami-0c02fb55956c7d316" # Deep Learning AMI
  
  instance_type = "p3.2xlarge"
  
  user_data = base64encode(<<-EOF
              #!/bin/bash            
              # CRITICAL: Enable GPU support in ECS config
              cat > /etc/ecs/ecs.config <<ECS_CONFIG
              ECS_ENABLE_GPU_SUPPORT=true
              ECS_ENABLE_TASK_ENI=true
              ECS_ENABLE_TASK_IAM_ROLE=true
              ECS_ENABLE_CONTAINER_METADATA=true
              ECS_ENABLE_TASK_METADATA=true
              ECS_ENABLE_SPOT_INSTANCE_DRAINING=true
              ECS_ENABLE_AWSLOGS_EXECUTION_ROLE_OVERRIDE=true
              ECS_AWSVPC_BLOCK_IMDS=true
              ECS_ENABLE_TASK_CPU_LIMIT=true
              ECS_ENABLE_TASK_MEMORY_LIMIT=true
              ECS_CONFIG
              
              # NVIDIA toolkit is pre-installed on the selected AMI
              EOF
  )
  
  vpc_security_group_ids = [aws_security_group.ecs_gpu.id]
  iam_instance_profile {
    name = aws_iam_instance_profile.ecs_instance.name
  }
}

# Auto Scaling Group
resource "aws_autoscaling_group" "gpu_ecs" {
  name                = "gpu-ecs-asg"
  desired_capacity    = 1
  max_size            = 10
  min_size            = 1
  target_group_arns   = []
  vpc_zone_identifier = var.subnet_ids
  
  launch_template {
    id      = aws_launch_template.gpu_ecs.id
    version = "$Latest"
  }
  
  tag {
    key                 = "Name"
    value               = "gpu-ecs-instance"
    propagate_at_launch = true
  }
  
  tag {
    key                 = "ECSCluster"
    value               = aws_ecs_cluster.gpu.name
    propagate_at_launch = true
  }
}

# ECS Capacity Provider
resource "aws_ecs_capacity_provider" "gpu" {
  name = "gpu-capacity-provider"
  
  auto_scaling_group_provider {
    auto_scaling_group_arn         = aws_autoscaling_group.gpu_ecs.arn
    managed_termination_protection = "DISABLED"
    
    managed_scaling {
      maximum_scaling_step_size = 1000
      minimum_scaling_step_size = 1
      status                    = "ENABLED"
      target_capacity           = 100
    }
  }
}

# ECS Cluster
resource "aws_ecs_cluster" "gpu" {
  name = "gpu-cluster"
  
  capacity_providers = [aws_ecs_capacity_provider.gpu.name]
  
  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.gpu.name
    weight            = 1
  }
}

Step 3: Configure Task Definition and Service

Create a task definition with GPU resources and a service that uses the capacity provider:

task-definition-svc.tf

# Task Definition with GPU resources
resource "aws_ecs_task_definition" "gpu_workload" {
  family                   = "gpu-workload"
  requires_compatibilities = ["EC2"]
  network_mode             = "awsvpc"
  cpu                      = 2048
  memory                   = 8192
  
  execution_role_arn = aws_iam_role.task_execution.arn
  task_role_arn      = aws_iam_role.task_role.arn
  
  container_definitions = jsonencode([
    {
      name  = "gpu-container"
      image = "nvidia/cuda:11.8-base-ubuntu20.04"
      
      command = ["/bin/bash", "-c", "nvidia-smi && echo 'GPU workload started' && sleep 3600"]
      
      essential = true
      
      log_configuration = {
        log_driver = "awslogs"
        options = {
          awslogs-group         = aws_cloudwatch_log_group.gpu_workload.name
          awslogs-region        = data.aws_region.current.name
          awslogs-stream-prefix = "gpu"
        }
      }
      
      port_mappings = [
        {
          container_port = 8080
          protocol       = "tcp"
        }
      ]
      
      environment = [
        {
          name  = "NVIDIA_VISIBLE_DEVICES"
          value = "all"
        },
        {
          name  = "NVIDIA_DRIVER_CAPABILITIES"
          value = "compute,utility"
        }
      ]
      
      # KEY: GPU resource requirement
      resource_requirements = [
        {
          type  = "GPU"
          value = "1"
        }
      ]
    }
  ])
}

# ECS Service using the capacity provider
resource "aws_ecs_service" "gpu_service" {
  name            = "gpu-service"
  cluster         = aws_ecs_cluster.gpu.id
  task_definition = aws_ecs_task_definition.gpu_workload.arn
  desired_count   = 1
  
  # KEY: Use the GPU capacity provider
  capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.gpu.name
    weight            = 1
  }
  
  network_configuration {
    subnets          = var.subnet_ids
    security_groups  = [aws_security_group.ecs_gpu.id]
    assign_public_ip = true
  }
}
# CloudWatch Log Group
resource "aws_cloudwatch_log_group" "gpu_workload" {
  name              = "/ecs/gpu-cluster/gpu-workload"
  retention_in_days = 7
}

# Data source for current region
data "aws_region" "current" {}

#Variables
variable "subnet_ids" {
  description = "Subnet IDs for ECS instances"
  type        = list(string)
}

variable "vpc_id" {
  description = "VPC ID for ECS cluster"
  type        = string
}

Step 4: Deploy and Monitor

Deployment Commands:

# Initialize and deploy with Terraform
terraform init
terraform plan
terraform apply
# Monitor GPU usage
aws ssm send-command \
  --instance-ids i-1234567890abcdef0 \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["nvidia-smi"]'
# Check ECS service status
aws ecs describe-services \
  --cluster gpu-cluster \
  --services gpu-service
  

Monitoring GPU Usage:

# Check GPU utilization on instances
aws ssm send-command \
  --instance-ids i-1234567890abcdef0 \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["nvidia-smi"]'
# Monitor ECS service
aws ecs describe-services \
  --cluster ecs-gpu-cluster-gpu-cluster \
  --services gpu-service

Best Practices

  1. Instance Selection: Choose GPU instances based on workload requirements and budget constraints
  2. Driver Management: Use Deep Learning AMI for production workloads to ensure driver compatibility
  3. Resource Planning: Monitor GPU utilization and scale capacity providers accordingly
  4. Cost Optimization: Use Spot instances for non-critical GPU workloads
  5. Security: Implement proper IAM roles and security groups for GPU instances

Conclusion

Running GPU workloads on ECS requires careful infrastructure planning and configuration. The key components include selecting the right AMI with GPU drivers, configuring capacity providers with proper user data scripts, and defining tasks with GPU resource requirements.

Start with a small GPU cluster using Deep Learning AMI to validate your setup, then scale based on workload requirements. Monitor GPU utilization and costs to optimize your infrastructure over time.

KubeNine Consulting helps organizations implement GPU workloads on ECS and other container platforms. Visit KubeNine—DevOps and Cloud Experts for assistance with your GPU infrastructure implementation.