Designing and Implementing an ECS Cluster on AWS for a Map Server — Part 3 and 4

If you've missed part 1 & 2 of this story, click here to read them!

Part 3: Implementation — Terraform modules

Part 4: Wrapping up — Continuous Deployment

Part 3: Implementation — Terraform modules

We use Terraform to deploy all our infrastructure. Terraform enables you to safely and predictably create, change, and improve infrastructure. It is an open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

In adding new Terraform modules we wanted to consider:

  1. Using Terraform modules we’d already created
  2. Dependencies between modules (e.g. the load balancer depends on SSL Certificate creation which depends on the Route53 DNS name).
  3. Setting up variables and constants so that the new modules can be used for other similar applications. Examples of variables: region, instance type, min/max amount of instances, container image, etc.

The three new modules we made:

  1. TileServer module: the TileServer service itself, sets up the variables and specifies the 2 other modules it will need
  2. ALB module: sets up the Application Load Balancer. from previous infrastructure, we had an ELB (Elastic Load Balancer) module but no ALB module. ALBs support more advanced routing so that traffic coming through port 80 can be forwarded to the containers.
  3. ECS module: sets up the Elastic Container Service. This module will need to use components created in the ALB module.

TileServer module

In order to create the TileServer module, we had to gather the necessary information as input:

  • Variables: various tags, ami name, bucket name, etc (from tfvars file)
  • Resources: Data and Output from other modules

    • AWS AMI: the most recent ECS optimized ami ID
    • S3 backend: bucket and key for the Terraform state file
    • Network data: VPC ID, Subnets, Availability zones

A snippet of the TileServer Terraform module and tfvars file:

#######################################################################
# Declaration of Variables
#######################################################################

variable "region"                     { description = "Region in which this is created" default = "us-east-1" }
variable "ami_name"                   { description = "AMI Name to get the ID for" }
variable "instance_type"              { description = "Instance type" default = "t2.small" }
variable "ssh_key_pair_name"          { description = "Name of the SSH key" }
variable "cluster_name"               { description = "Cluster name - Usually application name and environment" }
variable "tileserver_instances_min"   { description = "Minimum Number of TilesServer instances" }
variable "tileserver_instances_max"   { description = "Maximum Number of TilesServer instances" }
variable "tag_org"                    { description = "A tag for the org tag" default = "CBRE Build" }
variable "tag_loc"                    { description = "A tag for the loc tag" default = "nyc" }
variable "tag_env"                    { description = "A tag for the env tag" default = "staging" }
variable "tag_appname"                { description = "A tag for the appname tag" default = "tileserver" }
variable "tag_businessgroup"          { description = "A tag for the businessgroup tag" default = "dev" }
variable "tag_service"                { description = "A tag for the service tag" default = "tileserver" }
variable "tag_freetext"               { description = "A tag for the freetext tag" default = "cbre" }
variable "tag_name"                   { description = "A tag for the name tag" default = "" }
variable "tag_domain"                 { description = "A tag for the domain tag" }
variable "tag_branch"                 { description = "A tag for the branch tag" }
variable "shared_services_account_id" { description = "The ID of the shared services account (Can be prod or nonprod)" }
variable "application_port"           { description = "Application port number" }
variable "proxy_port"                 { description = "HAProxy port number" }
variable "application_dns_record"     { description = "Application DNS record" }
variable "s3_bucket"                  { description = "tileserver Configuration S3 bucket" }
variable "container_image"            { description = "Container Image" }
variable "container_name"             { description = "Container Name" }
variable "container_volume"           { description = "The Host source volume (mount point) and Container Path (prefixed by /)"}
variable "tileserver_containers_min"  { description = "Minimum Number of TileServer containers" }
variable "tileserver_containers_max"  { description = "Maximum Number of TileServer containers" }

#######################################################################
# Providers
#######################################################################
provider "aws" {region  = "${var.region}"}

#######################################################################
# Data & Null Resources
#######################################################################

data "aws_ami" "amazon-ecs-optimized" {
  most_recent = true
  name_regex  = "${var.ami_name}"
  owners      = ["amazon"]
}

#######################################################################
# Backend Configuration
#######################################################################

terraform {
  backend "s3" {
    bucket = "cbre-build-nyc-terraform-state"
    key    = "Services/tileserver-production/terraform.tfstate"
    region = "us-east-1"

}

A Terraform code snippet showing the TileServer module calling the ALB and ECS modules:

#######################################################################
#  Resources
#######################################################################

### Web Application ELB settings
## Creating the Webapp ALB
module "alb" {
  source                 = "git@github.com:floored/infra.git//terraform//modules//alb"
  alb_name               = "${var.cluster_name}"
  alb_internal_bool      = "false"
  subnet_ids             = ["${var.subnet_ids}"]
  target_group_port      = "${var.application_port}"
  instance_protocol      = "HTTP"
  lb_port                = "${var.proxy_port}"
  lb_protocol            = "HTTPS"
  alb_record_name        = "${var.application_dns_record}"
  vpc_id                 = "${var.vpc_id}"
  application_dns_record = "${var.application_dns_record}"
# TAGS #
  tag_org                = "${var.tag_org}"
  tag_loc                = "${var.tag_loc}"
  tag_env                = "${var.tag_env}"
  tag_appname            = "${var.tag_appname}"
  tag_businessgroup      = "${var.tag_businessgroup}"
  tag_service            = "${var.tag_service}"
  tag_freetext           = "${var.tag_freetext}"
  tag_name               = "${var.tag_org}-${var.tag_loc}-${var.tag_env}-${var.tag_service}-${var.tag_freetext}"
  tag_domain             = "${var.tag_domain}"
  tag_branch             = "${var.tag_branch}"
}

# Creating the ECS Cluster
module "ecs" {
  source                     = "git@github.com:floored/infra.git//terraform//modules//ecs"
  container_image            = "${var.container_image}"
  container_name             = "${var.container_name}"
  container_volume           = "${var.container_volume}"
  containers_min             = "${var.tileserver_containers_min}"
  containers_max             = "${var.tileserver_containers_max}"
  cluster_name               = "${var.cluster_name}"
  ssh_key_pair_name          = "${var.ssh_key_pair_name}"
  ami_id                     = "${var.ami_id}"
  subnet_ids                 = ["${var.subnet_ids}"]
  availability_zones         = ["${var.availability_zones}"]
  instance_type              = "${var.instance_type}"
  application_port           = "${var.application_port}"
  alb_sg_id                  = "${module.alb.alb_sg_id}"
  app_target_group_arn       = "${module.alb.app_target_group_arn}"
  instances_min              = "${var.tileserver_instances_min}"
  instances_max              = "${var.tileserver_instances_max}"
  route53_zone_id            = "${module.alb.route53_zone_id}"
  s3_bucket                  = "${var.s3_bucket}"
  vpc_id                     = "${var.vpc_id}"
}

ALB module

In this module we create all the load balancer components:

  • SSL certificate: using ACM - Amazon Certificate Manager
  • Security groups: acts as a virtual firewall that controls the traffic for one or more instances. You can add rules to each security group that allow traffic to or from its associated instances.
  • Route 53 record: DNS record to point to the Load Balancer DNS Name
  • Target groups: containers register to target groups and expose the application using high ports dynamically. This allows multiple containers to run on the same instance.
List of registered targets in the TileServer ALB Target Group. List of registered targets in the TileServer ALB Target Group. Note there are 2 unique instance IDs, each one represents an EC2 Instance with 2 containers exposing the application on 2 different dynamically assigned ports.

As mentioned, some of these components will be used by the ECS module and thus need to be exported as outputs. These outputs are then referenced by using ${module.module_name.output_name}.

Outputs:

  • alb_sg_id: The Load Balancer’s Security group. This security group will have access to the ECS cluster EC2 Instances.
  • app_target_group_arn: The Load Balancer’s Target Group where the containers register.
  • route53_zone_id: Route 53 (AWS DNS Service) Zone ID created by the ALB module
#######################################################################
# Outputs
#######################################################################

output "alb_sg_id" {
  value = "${aws_security_group.webapp_LB.id}"
}

output "app_target_group_arn" {
  value = "${aws_lb_target_group.app_target_group.arn}"
  depends_on = ["aws_lb_listener.front_end"]
}

output "route53_zone_id" {
  value = "${data.aws_route53_zone.webapp.id}"
}

ECS module

The ECS Module is in charge of all the ECS components, which is a long list of components that intertwine and sometimes depend on each other, including:

  • EFS components: Creating the Elastic FileSystem itself and its security groups.
  • EC2 User Data: The “initialization” script that runs when the EC2 Instance starts. This script includes mounting the EFS on the EC2 instance to be used later by the container which runs the application.
  • ECS Task Definition: Defines and configures how the Docker container should run. Some of the definitions include: CPU and memory allocation, container image, data volumes (which is very important to our infrastructure as our TileServer containers use tile map files and configuration hosted on this data volume), etc.
  • ECS Service: The definition of the amount of desired containers (tasks) and its configuration, the service scheduler is also responsible of maintaining the desired amount of containers in case they fail or stopped.
  • ECS Service Autoscaling: Creating the container autoscaling rules to determine the desired amount of containers running for our application and the rules for scaling in and out this number. We’ve decided to have a desired count of 4 containers which can scale out to 8 containers in case of high CPU or Memory usage for the service.
  • IAM Roles and Policies: All the necessary ECS and EC2 Instances roles and policies to give the cluster permission to run itself efficiently.
  • EC2 Austoscaling Groups and Launch configuration: Defining the Autoscaling group for the EC2 instances (apart from the ECS Service Autoscaling) to determine the number of EC2 instances that should host the application. We’ve decided to have a desired number of 2 EC2 Instances (hosting 2 containers each) and up to 4 EC2 instances (hosting 4 containers each) based on CPU and Memory Autoscaling policies.
TileServer service view in the ECS Cluster. The “Tasks” bar shows all the service containers and their status. TileServer service view in the ECS Cluster. The “Tasks” bar shows all the service containers and their status.

Part 4: Wrapping up — Continuous Deployment

After all this is set up, we now have a scalable, robust containerized cluster. The fully automated deployment takes approximately 3½ minutes to create the entire infrastructure!

Our last step is dealing with ongoing updates from our engineers. These can be new configuration (e.g. new map tiles files which are ready to be used) or styles (e.g. colors, fonts, icons). When engineers make these changes, we want to automatically update the TileServer to reflect these changes.

Updating the TileServer consists of a few steps:

  1. Making and committing changes to the git repo
  2. Verifying the configuration changes e.g. checking that the json is valid
  3. Checking for the existence of the necessary files referenced in the configuration e.g. map tiles should be stored remotely on AWS EFS and style files should be in the git repo
  4. Pushing the new configuration and styles to the containers
  5. Reloading the application process inside the container in order to reload the application on each EC2 Instances in the ECS Cluster

We scripted steps 2-5 and put it into Rundeck, a job scheduler and Run Book automation system, so developers can update the TileServer with a single click.

Rundeck Jobs screen with recent jobs activity status presented at the bottom.

Designing and implementing an ECS Cluster was a very challenging task, at the same time it’s also been an opportunity for us to build a service from the ground up using the best Devops practices and methodologies. This experience demonstrated to us how important it is to plan ahead, design and implement modular deployment to ensure a stable and scalable service and most importantly, this effort allowed us to create a self-serving deployment to host our own TileServer.


Idan Shifres is a Sr. Devops Engineer at CBRE Build and a Devops Evangelist. Between finding the best Devops delivery practices and developing Terraform modules, you can probably find him in Meetups or bars looking for his next IPA beer to add to his list.