Let’s talk about Terragrunt and how why you might want to use it to deploy your workloads.

When working on a project deployed via Terraform in AWS, especially a complicated one, whether it be microservices or a highly complex and large environment with many developers making changes to it, you may run into a few issues:

  1. Multiple changes get pushed at once, but you can only make one change at a time due to state locking.
  2. Resources included in components or applications different from the one you’ve worked on are being changed or altered when you try to push an update to your application.
  3. Someone makes a mistake and merges an old commit accidentally - this also rolls back your recent changes!

When a single state file stores the entire environment, it can become a monolith. It can work well for small projects; however, as you grow in size and complexity, it can become burdensome - even worse, with microservices, you may get stuck into having to manage your infrastructure in a monolith-like manner because your Terraform State is monolithic and making independent changes becomes tedious.

The complexity associated with managing your Terraform state only grows as your infrastructure does, so it’s wise to plan your strategy for state management early on.


Splitting your state

Splitting the Terraform State is simple if you only want one per environment. Still, as you try to split it further, you’ll run into tedious complexities. You’ll have to create a separate terraform project with its backend for every component.

This article explains the benefits of splitting your state, how we can do it in vanilla Terraform, and how you can use Terragrunt to make this significantly more straightforward. So, let’s start deciphering how exactly you should split your state.

Architecture Diagram



How much should you split it?

There’s a great article by Gustav Karlsson that goes over this, but I’ll give you the TLDR version.

Splitting by environment

Splitting the state by environment or account is most-likely something you’re already doing. Most of us make-use of a multi-account structure, and since each environment uses it’s own IAM Role, there is often a state file for each environment, in it’s own S3 bucket.

Splitting by component

Below is what Yevgeniy Brikman, the creator of Gruntwork and the latest Terraform Up & Running book recommends.

Splitting by component involves creating a seperate Terraform Project per component like so:

environment
 └ vpc
     └ main.tf
     └ outputs.tf
     └ variables.tf
     └ backend.tf
     └ provider.tf
 └ application
     └ pipeline
         └ codepipeline.tf
         └ codebuild.tf
         └ outputs.tf
         └ variables.tf
         └ backend.tf
         └ provider.tf
     └ app
         └ lambda.tf
         └ outputs.tf
         └ variables.tf
         └ backend.tf
         └ provider.tf
     └ data
         └ dynamodb
             └ dynamo.tf
             └ outputs.tf
             └ variables.tf
             └ backend.tf
             └ provider.tf
         └ events
             └ sqs.tf
             └ eventbridge.tf
             └ outputs.tf
             └ variables.tf
             └ backend.tf
             └ provider.tf

As you can see, we are splitting the state across each component - a component being a small subset of your infrastructure that makes sense to deploy together. It would help if you also tried to split resources that shouldn’t be modified - for example, you wouldn’t want to impact your database because of changes to your lambda functions.


DRYing your code

One of the issues with splitting the state is copying large chunks of code across your code base to ensure each component with a separate state file has a provider and back-end defined. Copying code might be acceptable when starting your project, but it’s almost certain that with such a painfully tedious way of managing your state, people maintaining the platform will become sloppy and start deviating from your methodology.

Introducing Terragrunt

Introducing Terragrunt - It isn’t a replacement for Terraform; it’s a wrapper. You use the Terragrunt tool to execute your Terraform. However, it doesn’t just run it but can:

  • Generate bits of Terraform code for you.
  • Allows you to call Terragrunt-specific functions to generate parts of this code dynamically.
  • Implement hooks in your Terraform to run specific scripts before or after your runs or even on error (such as rolling back to a previous infrastructure version).

We will use some of these features to create a DRY configuration that allows us to easily manage multiple AWS accounts by dynamically referencing an AWS IAM Role and splitting the state using a root configuration inherited by our terraform projects.


Practical Example

Let’s start by reviewing our example deployment and the required resources to get started.


Note: by following this guide, you may incur AWS costs and will deploy resources into your account. Proceed at your own risk.

The situation
Our manager has just reached out to us to set up a new AWS environment, as the company has opted to migrate to the platform (good choice!). So we’ve got the difficult job of determining how exactly to do that; we know we want to use IaC, and we understand we want to use Terraform, but how do we ensure we deploy using best practices and provide convenience to our engineers to reduce the risk of drifting from our established practices?

Terragrunt
We’ve decided to use Terragrunt because it will let us define one root configuration, separating our state and making deployment easy. In addition, we’ve opted to go with the following structure for our code:

accounts
 └ test
   terragrunt.hcl
     └ vpc
         └ main.tf
         └ terragrunt.hcl
     └ jumphost
         └ main.tf
         └ terragrunt.hcl
modules
  └ networking
    └ vpc
         └ <module_files>
    └ jumphost
         └ <module_files>

So, what will the above project do for us?

  • A root terragrunt.hcl file that defines our provider, our IAM role we’ll assume, and our S3 backend,
  • Our Terraform configuration will be split by account.
  • Each account will have sub-modules with their own terragrunt.hcl file and their own state file.
  • A centralised modules directory to write generic modules we can use across our accounts.

Requirements to proceed:

  1. Create an S3 bucket in your test AWS Account called tfstate-storage-<AWS-Account-ID>, replace <AWS-Account-ID> with your AWS Account ID.
  2. Create a DynamoDB table in your account called tfstate-lock, it should have 1 partition key called LockID.
  3. Create an AWS IAM Role called tf-role with administrative permissions with a trust policy allowing only your IAM user to assume it.
  4. Download and configure the AWS CLI with your IAM account that can assume the tf-role.


Creating the Terragrunt files

First, I’ll show you the main terragrunt.hcl file and explain what it means.

# accounts/test/terragrunt.hcl
generate "provider" {
    path = "provider.tf"
    if_exists = "overwrite_terragrunt"
    contents = <<EOF
provider "aws" {
    region = "ap-southeast-2"
    assume_role {
        role_arn = "arn:aws:iam::${get_aws_account_id()}:role/tf-role"
    }
}
EOF
}

remote_state {
    backend = "s3"
    generate = {
        path = "backend.tf"
        if_exists = "overwrite_terragrunt"
    }
    config = {
        bucket = "tfstate-storage-${get_aws_account_id()}"
        key = "${path_relative_to_include()}.tfstate"
        region = "ap-southeast-2"
        encrypt = true
        dynamodb_table = "tfstate-lock"
    }
}

Our code is telling Terragrunt to do a few things:

  1. Generate a provider.tf file in each path containing a terragrunt.hcl file that we initialise.
  2. That provider file will define our provider (AWS), the region, and the role we’ll use to authenticate. Next, it will look for the role in the account we will authenticate with; it does this by checking our AWS account using the get_aws_account() function.
  3. Generate a backend.tf file in each path containing a terragrunt.hcl file that we initialise.
  4. The backend will be set to S3, and the bucket will be set to tf-storage-<AWS-Account>, once again, using that get_aws_account() function.
  5. Now, the critical part - Terragrunt will name the state file the path from our directory (e.g. vpc.tfstate or jumphost.tfstate).

You may wish to replace `path_relative_to_include()` with a variable or your AWS account. Otherwise, it would be possible to execute this in the wrong account accidentally.


We can tell Terragrunt to inherit this configuration in each of our sub-modules by creating the following files:

# accounts/test/vpc/terragrunt.hcl

include "root" {
    path = find_in_parent_folders()
}

inputs = {
    trusted_hosts = {
        rdp = ["xx.xx.xx.xx/32"]
        ssh = ["xx.xx.xx.xx/32"]
    }
    vpc_config = {
        name = "test-vpc"
        cidr_block = "172.16.0.0/16"
    }
}
# accounts/test/jumphost/terragrunt.hcl

include "root" {
    path = find_in_parent_folders()
}

dependency "vpc" {
    config_path = "../vpc"
}

inputs = {
    ssh_key = "ssh-rsa ..."
    subnet_id = dependency.vpc.outputs.public_subnets["ap-southeast-2a"].id
    security_group_ids = [
        dependency.vpc.outputs.security_groups["allow_internet_access"],
        dependency.vpc.outputs.security_groups["allow_ssh_from_trusted_hosts"]
    ]
}


Our code tells Terragrunt to look up our directory tree for another terragrunt.hcl file it can inherit. Unfortunately, at this time, a Terragrunt configuration file can only inherit one file, so you can’t inherit a file that inherits a file. We’re also passing our desired configuration values for our module to Terragrunt.

We could also add specific input parameters to our root file, allowing us to inherit values across our whole account.


Creating the Terraform files

Now, we can create our standard Terraform files. We can reference our modules as usual.

# accounts/test/vpc/main.tf

variable "trusted_hosts" {type = map(list(string))}
variable "vpc_config" {type = map(string)}

module "vpc" {
    source = "../../../modules/networking/vpc"
    trusted_hosts = var.trusted_hosts
    vpc_config = var.vpc_config
}

output "vpc_id" {value = module.networking.vpc_id}
output "public_subnets" {value = module.networking.public_subnets}
output "private_subnets" {value = module.networking.private_subnets}
output "security_groups" {value = module.networking.security_groups}
accounts/test/jumphost/main.tf

variable "ssh_key" {type = string}
variable "subnet_id" {type = string}
variable "security_group_ids" {type = list(string)}

module "jumphost" {
    source = "../../../modules/networking/jumphost"
    ssh_key = var.ssh_key
    subnet_id = var.subnet_id
    security_group_ids = var.security_group_ids
}

output "jumphost_hostname" {value = module.jumphost.jumphost_hostname}
output "jumphost_id" {value = module.jumphost.jumphost_id}
output "jumphost_public_ip_address" {value = module.jumphost.jumphost_public_ip_address}

As you can see, we can just define our Terraform normally and we can still reference our VPC module outputs using dependencies in the terragrunt configuration file.

Deploying the infrastructure

Terragrunt allows us to deploy multiple components at once, but first we need to apply the vpc component, as the jumphost component relies on it’s outputs.

In your working directory, open the terminal and run the following commands:

michael@device~ cd accounts/test/vpc
michael@device~ terragrunt init # wait for terragrunt to initialise the module.
michael@device~ terragrunt plan # Confirm you're happy with the plan, if so then...
michael@device~ terragrunt apply #Apply the changes! (Ooooh Spooky)

If that deploys successfully, then you can proceed to doing the same thing on the jumphost component.

michael@device~ cd accounts/test/jumphost
michael@device~ terragrunt init # initialise it... again... -_-
michael@device~ terragrunt plan # If you don't get an error it's *probably* fine
michael@device~ terragrunt apply # Apply these changes too!

You’ve successfully deployed some infrastructure with Terragrunt.

You can confirm your state has been split successfully by checking your S3 bucket. You should see:

Architecture Diagram

Production Tips

Hooks

Terragrunt supports Before, After and Error Hooks that let you define tasks that run before, after and on errors. Consider using these, especially the tf-lint hook, to validate the terraform code executed by Terragrunt.

Pipeline Automation

You can use Terragrunt in a CI/CD solution like GitHub Actions or CodePipeline. However, there are several things you should consider:

  • You can use the terragrunt run-all plan and terragrunt-run-all apply commands from one of your account directories, but this has downsides. Ideally, set up smaller runners that execute terragrunt plan and terragrunt apply for each component concurrently. Deploying sub-modules together wouldn’t slow down your pipeline and would allow you to selectively approve specific components and allow developers to deploy changes to different components together without worrying about locking each other out.

  • If you use terragrunt run-all plan and terragrunt run-all apply, Terragrunt will execute modules in parallel if this causes issues it supports restricting this.