Usually, the first approach when using some cloud environment is lifting the EC2 instances, virtual machines, MQ services, and so on manually. This does not only include the manual interaction with the cloud provider, but also the manual configuration of those services.

When the business starts growing and the number of instances and services handled in the cloud also increases, it is common for configuration management tools like Ansible or Puppet to join the scene. This marks a cool moment as automation is included in infrastructure management, and configuration of services is now replicable. However, it is a potentially risky path to walk.

Configuration drift as cardboard boxes

What happens next

At this moment, infrastructure is "easily" created and configured. The business needs grow—new clients, new features, new requirements. The technology grows with it, likely in an uncontrolled way.

After three or four years, the team finds itself with fifty or sixty compute instances—each one running a different OS version according to the playbook configuration of the time it was created. Each one is running different software, each one configured by a different playbook, and each one with several manual changes that were "too small to worry about." All of this information is untracked, unversioned.

But is there really no problem? Because we can run new playbooks whenever we want to update whatever we need, right? Maybe for some time, but not forever. Let's explore the potential consequences.

After all this time

As time passes, applying configuration changes becomes increasingly riskier. There is no clear understanding of the state of each piece of infrastructure, the actual configuration diverges from the intended state and the "configuration drift" appears.

The risk escalates further when changes are made in the production environment. It's inevitable that, after applying a change, eventually the state becomes unstable due to "small" manual changes applied previously, potentially causing scripts to fail.

The result of this situation includes:

Downtime: Unstable infrastructure resulting from manual changes and configuration drift increases the likelihood of system failures and downtime.
Loss of Control: Without a clear understanding of the state of infrastructure, it becomes difficult to maintain control and enforce consistent policies and standards.
Fear of applying changes: Applying changes becomes an undesired event.
Security Vulnerabilities: Inconsistent configuration and infrequent updates cant introduce security vulnerabilities into the infrastructure.
Reduced Efficiency: Teams may spend more time troubleshooting issues and performing manual tasks.

Many questions arise in this situation: Is the current configuration acceptable? Can the scripts be trusted? Can the current state be replicated in a reasonable amount of time?

These are just a few of the consequences, but let's focus now on a possible solution.

Overcoming challenges

While the challenges described may seem intimidating, they also present an opportunity for improvement. GitOps and Infrastructure as Code (IaC) offer a powerful solution to address the issues of configuration management and infrastructure drift.

GitOps extends the principles of DevOps to infrastructure management, using Git as the single source of truth for defining and managing infrastructure configurations.
IaC complements GitOps by enabling infrastructure to be defined and managed using code. Infrastructure resources are described in a declarative way using configuration files, which can be versioned, tested, and automated like any other software artifact.

With this shifting in the way we approach the infrastructure management we get to solve many of the issues we had before.

The configuration is versioned, now everybody can get to know the way the system is set up.
The infrastructure is immutable. New elements will be created when there is a change in the code, and others will be destroyed.
We can plan automatic updates of software with CI/CD processes and test them.
The state of the infrastructure is deterministic. We can replicate it on demand and audit it through the source code.

Finally, I would like to recommend a couple of tools that enable developers define infrastructure as code and perform configuration ahead of time whenever possible:

Terragrunt: Simplifies the management of Terraform by adding a set of features and abstractions that will allow us to keep the code DRY.
Packer: Automates the creation of machine images, allowing developers to create reproducible machines. It synergizes well with configuration management tools like Ansible.

While these tools won't solve all of our problems and may introduce new challenges, when used in conjunction with another set of tools and practices, such as CI / CD, they enable us to keep under control our increasingly complex infrastructure. They will also set the bases for building our technology in a collaborative way.

Mutability, configuration drift and other undesirable effects of manually configuring your infrastructure

What happens next

After all this time

Overcoming challenges

Resources

More like this