Using Pulumi to deal with growing pains

September 20, 2022

Tiago Sousa

Amplemarket has witnessed incredible growth in the last year, and a lot of it is due to the constant pace of innovation within the product. This not only reveals the great talent we hold within the engineering team, designers, and all the stakeholders contributing to the product, but it also shows the agility of our Product Engineering organization that is essential to growing the team.

Even if we are more inclined towards choosing simplicity in the technology of our infrastructure, it was unavoidable to increase the complexity in a few areas to be able to operate at scale. Particularly in the area of databases. These are critical pieces of the puzzle in regards to:

Elasticity
As the business grows, we need the infrastructure to store and serve all our data that is also growing, and it's critical to have ways to easily increase the storage and compute capacity of our databases.
Redundancy
Running business-critical services means that it is simply not acceptable to have downtimes caused by having single points of failure. High availability is paramount for any piece of infrastructure our service depends on.
Upgrades
As the software evolves we want to take advantage of the latest advances in the technologies we use, with the minimum amount of effort. It is also essential for us to make sure that any security vulnerabilities found in such software are tackled as soon as mitigations are made available.
Privacy
We take our customers' data very seriously and use industry best practices to keep it secure. Encryption of data at rest is a must-have.
Backups
We need to be prepared at all times. All our data needs to be backed up in multiple locations in a secure way allowing us to quickly recover from a potential failure caused by the application or our cloud provider.

Dealing with growing pains

The cloud has become a gift from heaven for engineering teams. It has helped scaling startup businesses around the globe for the last decade, thanks to how easy it is to spin up machines and host software that scales with increasing demand. However, the scrappy nature of early-stage startups means that not all cloud resources are created equally and, some cases, have required a few configuration tweaks and short-term optimizations that have led to solutions that make the previous properties hard to achieve.

At Amplemarket, even if our team is characterized by it's strong ability to quickly respond and recover from incidents, we started noticing it getting harder and harder to predict future bottlenecks in the infrastructure to be able to limit or avoid any impact. There were a few reasons for this to be happening, and it was time to take action:

The infrastructure was created manually
Running Redis instances on GCE directly
Lack of monitoring of critical performance indicators
There were knowledge silos between team members
Configuration drift between environments and instances of the same service

Automating the cloud with Pulumi

When we started to evaluate options to automate our infrastructure, it was important for us that:

There was good support for managed GCP services
There was an ability to declare infrastructure as code
We could preview changes before being applied
Support for testing changes across environments
All changes requiring a review step with peer feedback
Changes applied automatically through CI pipelines

There were two main platforms that stood out and meet these requirements: Terraform, the industry sweetheart for Infrastructure as Code from Hashicorp, and Pulumi, the new kid in town with some extra niceties but following a similar philosophy:
We, the user, declare the resources we need and it, the tool, will figure out the state of your cloud at any point in time and do the minimal changes required for it to match your desired final state.This way, it will abstract each individual cloud provider’s APIs giving you, the user, a lingua franca for describing your infrastructure.

The reasons

The differentiator that Pulumi has is that instead of giving such abstraction in a proprietary and somewhat limited domain-specific language (HCL is what it is called in Terraform), its creators went on and gave you the full power of a programming language to describe your infrastructure. In fact, they give you six platforms (Node, Python, Go, .NET, Java and YAML) that are kept on an equal footing in regards to features, and allow you to pick your poison (aka programming language) to do your actual infrastructure as code.

Now, there may be philosophical reasons to decide whether one approach or the other is better, but we went with a pragmatic one: our relatively small engineering team is made out of generalist software engineers (we don’t have dedicated infrastructure engineers yet) and writing programs in maintainable and reusable ways is what we do best. Pulumi offered us an approachable option with a small learning curve that matched our skills.

It was also important for us to see how the tool has matured in the last few years since its inception and keeps a healthy pace of updates and innovations, but also how they cleverly make use of all existing open source Terraform modules and convert them to Pulumi packages, thus offering similar reach in supporting a significant number of cloud services out there.

Rolling out a pilot with managed Redis

We had to start somewhere and we decided to start with Redis because we had a maintenance burden in our most critical cluster and lacked some of the properties identified in the first section of this article: we managed its instance operating system directly and it was falling behind in updates, its setup was not highly available since we relied on disk persistency and we were reaching the memory limits of this machine.

Since Redis is a managed service offered by GCP, creating a Redis cluster in Pulumi was quite straightforward. Here’s a code snippet from our codebase that handles all our Redis service creation at Amplemarket:

‍

Stacks as bags of configuration

‍

After a few seconds, your managed cloud infrastructure just got a new resource. Go check your GCP Console and it should be there.

Enabling an unsupported beta feature

By moving to a cloud-managed Redis service and having its lifecycle managed with Pulumi we had already achieved many of the properties that led us to this journey but one, in particular, was still a challenge: a highly available setup of Redis implies a real-time replica always running alongside the main server but, if for some unlikely reason, we would lose both of them at the same time it could mean data loss and we don’t want that. Our previous setup by relying on disk persistency - which was automatically backed up at the filesystem level, always gave us a way to recover to a previous point in time.

It happens that Google's managed Redis service, supports a feature that gives us just that, called RDB snapshots: https://cloud.google.com/memorystore/docs/redis/rdb-snapshots.

‍

This code does a few things. Let’s review them step by step:

This worked and we were able to abstract our Redis cluster creations with Pulumi, leading to a successful rollout of four new clusters in production.

Be a good Pulumi citizen with the local.Command package

‍

As our understanding of Pulumi increased, we noticed that the local.Command package that they offer has built-in lifecycle control for resources created from shell commands. This was exactly what we wanted, given that our RDB setting was actually a sub-resource of a Redis resource that would follow the same lifecycle: we set it on cluster creation and don’t need to further change it, so running the command in tandem with the actual Pulumi-managed cluster was what we really wanted.

This was fairly straightforward to pull out with the following code:

‍

Voilà! Pulumi was now happy to manage our Redis beta feature for us!

Final Thoughts

From PoC in a week, to reaching production with confidence the week after, we’re super impressed by Pulumi’s focus on the developer experience and general ease of use. We have since expanded our usage of Pulumi to manage our infrastructure and have a GitHub Actions-powered workflow with automatic PR previews that is a breeze to use. We will be sharing more insights in follow-up blog posts, as the experience has been quite positive.

It also allowed us to have all our infrastructure configurations in a single place and become systematic about what and how to monitor these resources in production.

This approach allowed our team to become much less reactive to failures in production and move to a more proactive cloud management practice, where we:

Get alerted of approaching limits in the infrastructure
Easily test upgrades
Securely rollout configuration changes
Quickly scale up instances
Reuse building blocks when adding new infrastructure

And by doing so, we can focus on adding new features to our product and keep empowering Sales teams across the globe.

Note: We're still hiring an Infrastructure Engineer, so if you found this article your type of thing, please do reach out. It may so happen you’ll be the right person to help us improve and build on top of this setup.