Taming Chaos: A Step-by-Step Guide to Problem-Solving

Apr 12, 2025

Intro

💡

I want to clearly outline the focus of this article so you can avoid wasting your time on content that may not be relevant to you. My goal is to demonstrate how an IT professional or project manager can tackle issues when there is no clear initial solution and when implementing that solution requires changes across various related or unrelated systems. While I will concentrate on IT problems, the methodology I discuss is quite generic and can be applied in other professions as well.

Complex problems often require complex solutions. Finding the right solution is often like searching for a path to a destination (desired state of the system) without a map. Even once the path is found, it can be curly, windy, and unclear.

Before beginning any task, it's essential to address three key questions that are relevant in many aspects of life: Why, What, and How (Golden Circle).

These questions are vital as they help clarify the problem, its context, and the desired outcome, ultimately guiding your problem-solving process.

💡

To demonstrate an approach for solving complex problems, I will use a task I recently worked on: providing a simple way to migrate an application's authentication mechanism from IAM users to IAM roles in AWS EKS.

Why? Defining the objective.

Easy one, isn't it?

Your team needs to achieve a specific goal or resolve a particular problem. While you may already understand the issue, it’s crucial to document it to ensure a shared understanding among team members.

Why do we need to do this?
What is the expected outcome of this job?

By addressing both of these points in writing, you can eliminate any uncertainties from the outset.

⚠️

Before diving into the task, make sure you fully understand all the requirements for a successful outcome.

Why do we need to do this?

The answer should explain the importance of completing this task. Nobody enjoys doing unnecessary work!

ℹ️

Example
We need to migrate all applications using IAM users to IAM roles to increase AWS security posture. IAM roles increases security by short-lived credentials that are automatically refreshed. In the past security incidents related to IAM users access keys forced rotation of 100+ credentials which imposes threat to system stability.

It is crucial for an entire team to understand why they're doing this, as it can clarify how the solution should be structured and what the definition of done looks like.

Remember to get the requirements from the requestor/stakeholder!

What is the expected outcome?

The so-called definition of done. (DoD)

A DoD is a set of criteria that a product increment must meet for the team to consider it complete and ready for customers. It is a shared understanding among the team members of when a product increment is ready for release, even when the increment is large and consists of many items. By clearly defining what “done” means to the project, an Agile team can focus on delivering value with every sprint and minimizing rework.
~ Atlassian

ℹ️

Example
- application deployment model should allow to select IAM roles as AWS authentication mechamism
- all infrastructure applications should use dedicated IAM roles

It's important to note that the definition of done should only include statements related to the team's implementation of the solution.

For example, if the application team is responsible for switching the application to use IAM roles (because they need to verify AWS SDK compatibility first), we cannot include the statement "all applications use IAM roles" in the definition of done. This is because we do not have control over that aspect.

How? Solving the problem.

We understand why, we understand what, and now it’s time to discover how, as the best solution may not be clear from the beginning.

💡

Annegdote
My colleague was quite irritated because he had to change his approach multiple times due to the various solutions we suggested along the way. Each of us had a different understanding of the problem and desired approach. Thinking through the problem at the start can save you from pain on the road.

Complex problems should not be addressed based on the first idea that comes to your mind.

So, how to start? Write down all feasible solutions and their advantages, disadvantages, and implications.

Sometimes, it can be challenging to come up with solution ideas on your own; that's why having the team's support is important.

Organize a dedicated brainstorming session to gather ideas and ensure that the entire team has a common understanding of the problem. Write everything down.

Six Thinking Hats

written by Dr. Edward de Bono
(thanks Alicja for this one!)

"Six Thinking Hats" and the associated idea of parallel thinking provide a means for groups to plan thinking processes in a detailed and cohesive way, and in doing so to think together more effectively.
~ source

This method can be helpful during brainstorming sessions to contradict some of the best feasible approaches and detect issues early.

In discussions, a "hat" can be given to a participant to help overcome mental barriers about specific solutions or to consider alternative perspectives.

Allow some time for everyone to think it through and investigate the feasibility of the ideas. Then, schedule another meeting to make the final decision.

⚠️

It is crucial to dedicate 100% of your attention during this session. Please set aside your work and focus on understanding the problem. Even if you don’t see a feasible solution, asking questions can support others. Consider inquiries like: "Can we do that?" "Why can't we do this?" or "Why does it work like that?" and so on.

Imagine that tomorrow you will take over that task. Would you know how to proceed with it? Even if you're not responsible for managing it, you might still implement a part of it.

The answers for why, what, and how should be written in an ADR.

ADR (Architecture Decision Record)

An ADR is a high-level document that should answer questions regarding a significant architectural change, such as:

Why is this change needed?
Why was this solution chosen?
How the (architecture) change looks like and how the system behavior will change?
Were there any alternative solutions? Why were they rejected?
What are the positive/negative consequences of introducing that change?

Good examples of ADRs:

To learn more about ADRs, you can read:

ℹ️

Example decision

Based on the voting results, we will proceed with the Validating Admission Webhook. It solves most of the problems and doesn't introduce extra complexity in the system.
As this is a relatively new area of work for us, it is essential to highlight any unexpected behavior or errors that will impact the current deployment workflow. If a solution does not meet our expectations, we should quickly switch to another solution.

We have written an ADR and selected a feasible solution; can we start working now?
Not so fast! We're almost there, I promise.

Plan your work

Now is the time to implement a well-known Caesar strategy: divide and conquer. Outline all necessary steps and changes required to achieve the final objective. Additionally, include any extra tasks that could be completed if time permits, such as refactoring or adding observability features.

ℹ️

Example
Create EKS Pod Identity Webhook to create Pod Identity Associations

That objective can be further stripped into smaller pieces
- Write that project documentation
- Implement skeleton
- Implement main functionality
- Add OpenTelemetry traces
- Add Prometheus metrics
- Extend "application_user" Terraform module to create IAM role
- Update AWS SDK in infrastructure Go applications
- Create migration process monitoring

This should allow us to estimate the time required to implement those tasks and also define where priorities should be placed. Having objectives split into multiple smaller tasks allows for more straightforward assignments to various team members if required, as we should see which tasks can be done in parallel and which must be implemented in a particular order.

MoSCoW Method

Reference: https://en.wikipedia.org/wiki/MoSCoW_method

All requirements are important, however to deliver the greatest and most immediate business benefits early the requirements must be prioritized. Developers will initially try to deliver all the Must have, Should have and Could have requirements but the Should and Could requirements will be the first to be removed if the delivery timescale looks threatened.

This method can be used to prioritize tasks using the following labels:

must have (required by the definition of done)
should have
could have
won't have

Ready?

Let's get to work!

Krzysztof Wiatrzyk

Big love for Kubernetes and the entire Cloud Native Computing Foundation. DevOps, biker, hiker, dog lover, guitar player, and lazy gamer.

Emergency Access Done Right: AWS Break Glass Policy Explained