What are the official standards in incident management?
ITIL is the best-known standard within service management. Incident management is a process within ITIL, which first defines what an incident is and then some best practices are collected through feedback from many companies. I used to have a concern about ITIL, and the existence of those procedures and strictly defined ways of reacting, but that has changed since it upgraded to its 4th version. I think strict procedures can significantly limit the awareness of people who need to react to incidents, and they often define the way to react even when that way does not fulfill the purpose. In some situations, people should have the freedom to make a good decision based on the situation, because procedures are written for certain kinds of reactions, and they cannot be applied to absolutely all cases.
It is also good to have defined roles in Incident Management. One important role is the Incident Commander. This is a person that has the executive authority to manage an incident. They can make decisions based on the proposed solutions, they can monitor the troubleshooting process, and they can coordinate everything that is happening during an incident. Usually, this should be a technical person, but not rarely it is a service manager or project manager. It is of great importance that the person understands the broader context and processes in the technological domain in which they are appointed incident manager.
When it comes to the process itself, I could easily draw parallels between Incident and Project Management. An incident is like a small project; it begins when a disruption of service occurs, and it ends when the service returns to normal operations.
Concerning reviewing events during an incident, my practice for every high severity incident, has been a complete review. This means we summarize how the incident occurred, why it occurred, what the resolution flow was, whether the right people were involved at the right point and what we can do differently and better next time. The term for that is Incident post mortem.
After we have detailed the timeline, we can move on to the way the incident resolution was supposed to flow in an ideal scenario. We discuss who was supposed to be included at what point, what could have been done differently, and most importantly – Root Cause Analysis. When we say Root Cause Analysis, we don’t refer to just problems. I don’t appreciate Problem Management much anymore. I realize its purpose and significance if we discuss regulatory bodies and generally reporting on risks at higher levels of financial institutions. But as such this has no value in IT, because it cannot be said that an incident or any service disruption has a single root cause. Usually, it’s a set of circumstances that lead to a situation and in order to fix it and avoid a repetition of a similar situation in the future, various actions must be taken, it’s never just one. We usually determine what actions need to be taken so that the problem doesn’t repeat itself, and we also prioritize them at the end of the post mortem, when we appoint owners to all of those actions and follow their resolution in the timeline.