Enabling Security Guardrails: Infra as Code with CDK for Terraform

In this post, we describe how the Zip security team leveraged the Python CDK for Terraform (CDKTF) to enforce security guardrails for our AWS infrastructure. We provide example configurations and code to help other security teams build their own secure AWS infrastructure-as-code.

Imagine these cars as changes in your cloud infrastructure. They're moving fast and could potentially be bad or dangerous changes, creating chaos. We'll demonstrate how to deploy resources securely without disrupting the normal flow of traffic of building infrastructure.

Background

Like any early stage startup, Zip’s AWS infrastructure management primarily involved click-ops or making changes through the web console/CLI. This was non-ideal from a security perspective, but gave developers the ability to easily and quickly build products.

As the company grew, the number of infrastructure engineers with AWS administrative rights increased. For most, these permissions were excessive. Click-ops also made it difficult to require changes to be reviewed by a peer, and there was limited visibility into the impact of changes.

As the security team, we wanted to up-level our infrastructure by limiting the number of AWS admins, enforcing reviews for all changes, improving auditability, providing guardrails, while simultaneously enabling developers to confidently build infrastructure.

We evaluated multiple solutions and decided to migrate our infrastructure to infrastructure-as-code (IaC) using Terraform CDK (Terraform Cloud Development Kit) with Python. Terraform CDK acts as an overlay to Terraform that can be managed with TypeScript, Python, Java, C#, and Go. The benefits from using a dynamic programming language allow for more flexibility in how we deploy our infrastructure. While there are many resources for security best practices and tools with Terraform with HCL, there are few for Terraform CDK.

In this blog post, we’ll show how we leveraged Terraform CDK to provide a set of powerful security tools and guardrails, leading to a 95% reduction in AWS admins and 100% removal of click-ops for critical production resources. Our goal will be to demonstrate how we implemented the following:

Reduction in security attack surface by limiting admin level permissions to make changes to IAM, RDS, and other sensitive resources
Secure by default ways for anyone to create a resource, validate the change, and apply it
Removal of higher level roles and policies for human users to make changes through click-ops limiting an attackers potential leverage

Stay tuned for our next blog post for more details on how we conducted our evaluation, set up the CI/CD, imported the state, and codified our resources.

Designing the repo structure

Our Terraform folder is structured to separate all of our environments into their own stacks. We also created a set of secure templates for resources, which are shared across our infrastructure. As we show in the next section, these templates are used by developers to instantiate resources with our secure defaults.

/secure_templates
├── s3.py
├── iam.py
├── rds.py
└── ...
/production
├── resources
│   ├── rds.py
│   ├── iam.py
│   └── s3.py
└── main.py
/dev
├── resources
│   ├── iam.py
│   └── s3.py
└── main.py
...

Each stack has a dedicated IAM role provisioned via OIDC through GitHub Actions. To isolate from our existing infrastructure, we chose to have them operate on their own runners due to the sensitivity of the permissions they use.

Creating secure default resource templates

Instead of allowing developers to use the base Terraform CDK libraries, the security team built custom Python classes to implement Terraform resources. This allowed us to define secure configurations and prevent developers from making dangerously configured resources, with the flexibility of Python constructs.

For example, in our secure template for RDS, we do not allow databases to be publicly accessible unless they are in an allowlist.

# /secure_templates/rds.py

from cdktf_cdktf_provider_aws.db_instance import DbInstance

ALLOWED_PUBLICLY_ACCESSIBLE_DB_NAMES = ["public_db_1", "public_db_2"]


class DatabaseInstance(DbInstance):
    """AWS DB instance."""

    def __init__(
        self,
        stack: TerraformStack,
        db_name: str,
        tags: dict[str, str],
        multi_az: bool = False,
        storage_encrypted: bool = True,
        publicly_accessible: bool = False,
        **kwargs,
    ):
        """
        Constructs a new DB instance.

        param stack: CDKTF stack.
        param db_name: Name of the DB instance.
        param tags: Tags to apply to the DB instance.
        param multi_az: Whether to create a multi-AZ DB instance.
        param storage_encrypted: Whether to encrypt the storage.
        param publicly_accessible: Whether the DB instance is publicly accessible. Default false.
        """
        if (
            publicly_accessible is True
            and db_name not in ALLOWED_PUBLICLY_ACCESSIBLE_DB_NAMES
        ):
            raise SecurityException(
                "This database cannot be public. Please reach out to security@ for more details."
            )
        super().__init__(
            stack,
            id_=db_name,
            multi_az=multi_az,
            storage_encrypted=storage_encrypted,
            publicly_accessible=publicly_accessible,
            tags=tags,
            **kwargs,
        )

Developers can use this template class to instantiate their databases in a resource file like this:

# /production/resources/databases.py

from secure_templates.rds import DatabaseInstance


def generate_databases(stack: TerraformStack, tags: dict[str, str], other_providers: dict[str, AwsProvider]):
    DatabaseInstance(
        stack,
        db_name="zip-db",
        tags=tags,
    )
    ...

In our main.py, where we create our stack, we can now import the generation of databases. This allows for the secure use of a resource without the end developer needing to know the secure by default configurations we already define.

# /production/main.py

from cdktf import App, S3Backend, TerraformStack
from cdktf_cdktf_provider_aws.provider import AwsProvider, AwsProviderDefaultTags
from constructs import Construct

from resources.databases import generate_databases
#from ...resources... import ...functions... 

class ProdStack(TerraformStack):
    """Stack for Prod AWS Account."""

    def __init__(self, scope: Construct, id: str):
        super().__init__(scope, id)
        self.tags = {
            "env": f"{ENVIRONMENT}",
            "team": f"{TEAM}",
            "terraform-managed": "true",
            "zip:cost-allocation": "production",
        }
    	self.load_resources()

    def load_resources(self):
        generate_databases(self, self.tags, self.other_providers)
        generate_s3(...)
        generate_iam(...)

To protect our secure templates, we include an entry in CODEOWNERS to set the security team as a required reviewer for any pull requests with changes to the secure_templates folder.

# .github/CODEOWNERS

secure_templates/* @ziphq/security
production/main.py @ziphq/security

Preventing click-ops and enforcing IaC

As we migrated our infrastructure into IaC, we wanted to restrict our engineering team from making changes to AWS via the AWS console and CLI.

For all of our resources defined in Terraform, we added a terraform-managed tag:

# production/main.py
from resources.databases import GenerateDatabases

class ProdStack(TerraformStack):
    """Stack for AWS Terraform Account."""

    def __init__(self, scope: Construct, id: str):
         self.tags = {
            "terraform-managed": "true",
            ...
         }
         self.load_resources()
         ...

    def load_resources(self):
         generate_databases(self, self.tags, self.other_providers)


# production/resources/databases.py
def generate_databases(stack, tags, other_providers):
    DatabaseInstance(
        stack,
        db_name="zip-db",
        tags=tags,
    )

Using an SCP, we denied non-read access to these terraform-managed resources to all principals, with the exception of a few emergency on-call engineers and the Terraform runner. With this SCP in place, we now guarantee that all changes to resources must undergo our Terraform change process, which includes mandatory code review and CI checks.

{
    "Effect": "Deny",
    "NotAction": [
	"tags:List*",
	"iam:Get*",
	"iam:List*",
	"ec2:Describe*",
	... // all other read only permissions
],
    "Resource": "*",
    "Condition": {
        "StringEquals": {
            "aws:ResourceTag/terraform-managed": "true"
        },
        "StringNotLike": {
            "aws:PrincipalARN": [
                "arn:aws:sts::*:assumed-role/*/oncall@ziphq.com*",
                "arn:aws:iam::*:role/aws-reserved/sso.amazonaws.com/*/AWSReservedSSO_admin_*"
                "arn:aws:iam::*:role/terraform-runner"
            ]
        }
    }
}

Audit logging and visibility

Throughout these steps of deploying resources, securing, and preventing misconfigured infrastructure from being created, we wanted to ensure we had visibility at all layers. In order to achieve this we identified the following as good signals to use for our telemetry:

Action runner logs to gain insight into who was making what change and when
Adding codeowner files to any sensitive resources allowing both notification and review from security
Creating specific alerts for the terraform runner including tracking denies on the terraform-managed resource tag if the SCP was triggered

Results

Using our templates, we ensured all of our critical production resources had correct configurations. During the migration process, the team identified and fixed a few minor configurations in resources as we created our secure templating pathway. This included updating security groups to reflect the correct inbound and outbound rules, while also making exceptions for specific use cases. In addition, we also staged any resources that were no longer in use for removal to reduce excess attack surface.

Through our SCP to enforce Terraform use, we achieved 100% code review for IAM, S3, RDS, Security Groups changes, and reduced the number of AWS admins by 95%.

Special thanks to the team at Zip who helped with these achievements:

Eric Zhang for supporting the secure architecture design and coordinating the migration. If you enjoyed this post and are interesting in joining the Zip security team, please reach out!
Chris Zhen, Kaifeng Yao, and the rest of Zip’s infrastructure team for providing feedback on our plans and being the first users of our IaC.

Up next

Stay tuned for our next blog post for more details on our evaluation, how we set up the CI/CD, how we imported the state, and codified our infrastructure.

Evaluating different TACOs (Terraform Automation and Collaboration Software)
Setting up Terraform CI/CD
Authenticating the Terraform Runner with Github
Importing our AWS Infrastructure into Terraform State
Codifying our AWS Infrastructure in Python