Using Pulumi to deal with growing pains

Amplemarket has witnessed incredible growth in the last year, and a lot of it is due to the constant pace of innovation within the product. This not only reveals the great talent we hold within the engineering team, designers, and all the stakeholders contributing to the product, but it also shows the agility of our Product Engineering organization that is essential to growing the team.

Even if we are more inclined towards choosing simplicity in the technology of our infrastructure, it was unavoidable to increase the complexity in a few areas to be able to operate at scale. Particularly in the area of databases. These are critical pieces of the puzzle in regards to:

  1. Elasticity
    As the business grows, we need the infrastructure to store and serve all our data that is also growing, and it's critical to have ways to easily increase the storage and compute capacity of our databases.
  2. Redundancy
    Running business-critical services means that it is simply not acceptable to have downtimes caused by having single points of failure. High availability is paramount for any piece of infrastructure our service depends on.
  3. Upgrades
    As the software evolves we want to take advantage of the latest advances in the technologies we use, with the minimum amount of effort. It is also essential for us to make sure that any security vulnerabilities found in such software are tackled as soon as mitigations are made available.
  4. Privacy
    We take our customers' data very seriously and use industry best practices to keep it secure. Encryption of data at rest is a must-have.
  5. Backups
    We need to be prepared at all times. All our data needs to be backed up in multiple locations in a secure way allowing us to quickly recover from a potential failure caused by the application or our cloud provider.

Dealing with growing pains

The cloud has become a gift from heaven for engineering teams. It has helped scaling startup businesses around the globe for the last decade, thanks to how easy it is to spin up machines and host software that scales with increasing demand. However, the scrappy nature of early-stage startups means that not all cloud resources are created equally and, some cases, have required a few configuration tweaks and short-term optimizations that have led to solutions that make the previous properties hard to achieve.

At Amplemarket, even if our team is characterized by it's strong ability to quickly respond and recover from incidents, we started noticing it getting harder and harder to predict future bottlenecks in the infrastructure to be able to limit or avoid any impact. There were a few reasons for this to be happening, and it was time to take action:

  • The infrastructure was created manually
  • Running Redis instances on GCE directly
  • Lack of monitoring of critical performance indicators
  • There were knowledge silos between team members
  • Configuration drift between environments and instances of the same service

Automating the cloud with Pulumi

When we started to evaluate options to automate our infrastructure, it was important for us that:

  1. There was good support for managed GCP services
  2. There was an ability to declare infrastructure as code
  3. We could preview changes before being applied
  4. Support for testing changes across environments
  5. All changes requiring a review step with peer feedback
  6. Changes applied automatically through CI pipelines

There were two main platforms that stood out and meet these requirements: Terraform, the industry sweetheart for Infrastructure as Code from Hashicorp, and Pulumi, the new kid in town with some extra niceties but following a similar philosophy:
We, the user, declare the resources we need and it, the tool, will figure out the state of your cloud at any point in time and do the minimal changes required for it to match your desired final state.This way, it will abstract each individual cloud provider’s APIs giving you, the user, a lingua franca for describing your infrastructure.

The reasons

The differentiator that Pulumi has is that instead of giving such abstraction in a proprietary and somewhat limited domain-specific language (HCL is what it is called in Terraform), its creators went on and gave you the full power of a programming language to describe your infrastructure. In fact, they give you six platforms (Node, Python, Go, .NET, Java and YAML) that are kept on an equal footing in regards to features, and allow you to pick your poison (aka programming language) to do your actual infrastructure as code.

Now, there may be philosophical reasons to decide whether one approach or the other is better, but we went with a pragmatic one: our relatively small engineering team is made out of generalist software engineers (we don’t have dedicated infrastructure engineers yet) and writing programs in maintainable and reusable ways is what we do best. Pulumi offered us an approachable option with a small learning curve that matched our skills.

It was also important for us to see how the tool has matured in the last few years since its inception and keeps a healthy pace of updates and innovations, but also how they cleverly make use of all existing open source Terraform modules and convert them to Pulumi packages, thus offering similar reach in supporting a significant number of cloud services out there.

Rolling out a pilot with managed Redis

We had to start somewhere and we decided to start with Redis because we had a maintenance burden in our most critical cluster and lacked some of the properties identified in the first section of this article: we managed its instance operating system directly and it was falling behind in updates, its setup was not highly available since we relied on disk persistency and we were reaching the memory limits of this machine.

Since Redis is a managed service offered by GCP, creating a Redis cluster in Pulumi was quite straightforward. Here’s a code snippet from our codebase that handles all our Redis service creation at Amplemarket:

export function createRedis(config: RedisConfig): gcp.redis.Instance {  
    return new gcp.redis.Instance(config.name, {
        name: config.name,
        displayName: config.name,
        memorySizeGb: config.memory_gb,
        authEnabled: true,
        tier: config.tier,
        maintenancePolicy: {
            weeklyMaintenanceWindows: [{
                day: config.maintenance_window_day,
                startTime: {
                    hours: config.maintenance_start_hours,
                    minutes: config.maintenance_start_minutes
                }
            }]
        },
        redisConfigs: getRedisConfig(config),
        redisVersion: config.redis_version
    });

We define a gcp.redis.Instance and fill in its arguments from a configuration structure that can vary between different clusters while keeping some sensible and secure defaults, even enforcing security policies through code with fixed parameters like authEnabled.

The configuration structure is filled and since Typescript is type-checked from a simple YAML configuration, as the following Pulumi.staging.yaml shows:

ampledash-infra:redis:  
  - name: redis-sidekiq
    active_defrag: true
    maintenance_start_hours: 8
    maintenance_start_minutes: 0
    maintenance_window_day: SUNDAY
    max_memory_policy: noeviction
    memory_gb: 2
    redis_version: REDIS_6_X
    tier: BASIC

Stacks as bags of configuration

Pulumi allows configuration options to be created by the environment through a structure that they call stacks. These are logically segregated sets of configurations that can be applied independently by using Pulumi CLI commands. In our case, and since we have a staging and production environment, we have created two stacks: Pulumi.staging.yaml and Pulumi.production.yaml.

pulumi stack init production –copy-config-from staging  

Once we configure Pulumi’s credentials (using a PULUMI_ACCESS_TOKEN env var) and log into our cloud provider in the shell, we can preview the creation of our cluster with pulumi preview and, if everything checks out, we can make it real with pulumi up.

After a few seconds, your managed cloud infrastructure just got a new resource. Go check your GCP Console and it should be there.

Enabling an unsupported beta feature

By moving to a cloud-managed Redis service and having its lifecycle managed with Pulumi we had already achieved many of the properties that led us to this journey but one, in particular, was still a challenge: a highly available setup of Redis implies a real-time replica always running alongside the main server but, if for some unlikely reason, we would lose both of them at the same time it could mean data loss and we don’t want that. Our previous setup by relying on disk persistency - which was automatically backed up at the filesystem level, always gave us a way to recover to a previous point in time.

It happens that Google's managed Redis service, supports a feature that gives us just that, called RDB snapshots: https://cloud.google.com/memorystore/docs/redis/rdb-snapshots.

However, at the time, this was still a beta feature from Google’s service, and no native Pulumi support existed to enable it. Given the fact that a Pulumi program is really just a Node app, we could theoretically just shell out to the corresponding gcloud command using the child_process package:

const cp = require('child_process');

// Hack to enable persistence config for RBS
function updatePersistenceConfig(instance) {  
    let region = new pulumi.Config("gcp").require("region");
    const command = `gcloud beta redis instances update ${instance} --quiet --region ${region} --persistence-mode=${config.persistence_mode} --rdb-snapshot-period=${config.snapshot_period}`

    if (pulumi.runtime.isDryRun()) {
        pulumi.log.info(command);
    } else {
        cp.execSync(command);
    }
}

redis.id.apply(updatePersistenceConfig);  

This code does a few things. Let’s review them step by step:

  1. First, we declare the dependency on child_process in order to use execSync to shell out a command.
  2. We prepare the command given an instance id and interpolate the specific persistence mode options from the configuration structure.
  3. We then go ahead and apply it only when this is not a Pulumi dry run (which is what happens when pulumi preview is run).
  4. The last line is a consequence of the async nature of Pulumi resource management, where it optimizes the runtime to do as many operations in parallel as possible. Thus, we need to create a causal relationship to only run this code once the declared redis instance is actually created and has an id output already calculated.

This worked and we were able to abstract our Redis cluster creations with Pulumi, leading to a successful rollout of four new clusters in production.

Be a good Pulumi citizen with the local.Command package

Even if this worked, the previous code had a slight nuisance: every time we would run a pulumi up command to create or update any resource, even if given Pulumi’s minimal diff application of changes it meant that no Redis cluster had to change, we would always re-apply the RDB snapshot setting. This was not a big problem given its idempotent nature, but running it for every Redis cluster on every execution was causing slower runs and unnecessary noise in our cloud audit logs.

As our understanding of Pulumi increased, we noticed that the local.Command package that they offer has built-in lifecycle control for resources created from shell commands. This was exactly what we wanted, given that our RDB setting was actually a sub-resource of a Redis resource that would follow the same lifecycle: we set it on cluster creation and don’t need to further change it, so running the command in tandem with the actual Pulumi-managed cluster was what we really wanted.

This was fairly straightforward to pull out with the following code:

if (config.persistence_enabled) {  
    new local.Command(`${config.name}-rdb`, {
        create: pulumi.interpolate`gcloud beta redis instances update ${instance.id} --quiet --region ${stackDefaultRegion} --persistence-mode=${google.redis.v1.PersistenceConfigPersistenceMode.Rdb} --rdb-snapshot-period=${config.snapshot_period} 2>&1`,
        delete: pulumi.interpolate`gcloud beta redis instances update ${instance.id} --quiet --region ${stackDefaultRegion} --persistence-mode=${google.redis.v1.PersistenceConfigPersistenceMode.Disabled} 2>&1`
    }, { parent: instance });
}

Voilà! Pulumi was now happy to manage our Redis beta feature for us!

Final Thoughts

From PoC in a week, to reaching production with confidence the week after, we’re super impressed by Pulumi’s focus on the developer experience and general ease of use. We have since expanded our usage of Pulumi to manage our infrastructure and have a GitHub Actions-powered workflow with automatic PR previews that is a breeze to use. We will be sharing more insights in follow-up blog posts, as the experience has been quite positive.

It also allowed us to have all our infrastructure configurations in a single place and become systematic about what and how to monitor these resources in production.

This approach allowed our team to become much less reactive to failures in production and move to a more proactive cloud management practice, where we:

  1. Get alerted of approaching limits in the infrastructure
  2. Easily test upgrades
  3. Securely rollout configuration changes
  4. Quickly scale up instances
  5. Reuse building blocks when adding new infrastructure

And by doing so, we can focus on adding new features to our product and keep empowering Sales teams across the globe.

Note: We're still hiring an Infrastructure Engineer, so if you found this article your type of thing, please do reach out. It may so happen you’ll be the right person to help us improve and build on top of this setup.

Tiago Sousa

A passionate technologist working from Lisbon in multiple technical leadership roles across growing startups. Designing developer-friendly APIs, scalable systems and observable deployment pipelines.

Lisbon, Portugal

Subscribe to Amplemarket Blog | Sales Tips, Email Resources, Marketing Content

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!