Author Archive: Paco

rss feed

How to get things done after a post-mortem

How to get things done after a post-mortem

| June 28, 2023 | 0 Comments
Software diagnosis

                         Software diagnosis

Everyone working on products that serve customers has written or at least seen a post-mortem. In the aftermath of any production incident, a post-mortem analysis plays a crucial role in identifying what went wrong, what went right, and, most importantly, how to learn from the experience. Here by production incident could have multiple forms: system failure, outage, major bug…or even you run that script that was supposed to use staging database credentials but ended up messing with the production databases.

Independently on what led you there, conducting a post-mortem is just the beginning. To drive meaningful change and continuous improvement, it is vital to establish actionable points that can be implemented. Sometimes these action points are small changes that only require a few hours of work. Here the impact vs. effort plays in favor of giving urgency and priority to the change. But what happens when one action point involves large and complex projects? These are the type of change that needs weeks (maybe even months) of work outside of an already busy product roadmap. While the impact of the changes is high, the effort is also high.

Everything starts well, a refinement is scheduled, the project is split into multiple and isolated parts, and everyone is enthusiastic about fixing it so we do not have to deal with the same mess again. But then people go back to their other work, and slowly, the project becomes oblivion. Most of the time, a workaround can be applied if the production incident happens again, reducing the business impact.

Does this situation sound familiar? I will share something that I have seen applied and could work. It is not a magical formula, but it is what has worked for me in the past.

Action points labeling

Every action point in a post-mortem receives one of these two labels: 

Label action points

Operational actions

These are all the small changes, the quick fixes that solve the problem. These are included as part of the responsibilities of the engineer (or team) who is on-call. Here the ownership relies upon the engineers on-call

Strategical actions

These are the long-term significant system-level changes. The ownership of this need to be transferred to someone in the engineering organization who has the mandate (and skills) to make it part of the product strategy. It is tempting to agree to dedicate a few hours here and there and make it happen as a “side” project inside a team. In most organizations, that will not stand the pass of time.
A Staff engineer, an Engineer Manager…have to put in the time to translate the action points into business strategy.

Where is the difference?

“Why does this project need to have high priority?”

Versus:

“The completion of this project is aligned with the strategic business goals or will slow them down if we do not address it.”


What happens when a single action point is a large change?

This needs immediate action, and labeled as strategic will not help. In this case, I would suggest translating the action point into a risk: “If we do not fix this, we will lose 30% of our customer transactions.”