IT Operations are generally a cautious bunch of people when it comes to making decisions in their organization. “If it ain’t broke, don’t fix it” is regarded as the unwritten first rule of the profession. There is a very good reason for this precaution, and that is that no one wants to be put in the position of having to explain why the website is down to management; especially if the reason has to do with upgrading or refactoring something which was working fine, or at least understandably, prior to the upgrade.
Despite this caution, IT Operations as a group may experience difficulty recognizing the effects of change and adapting to those effects as a result.
While IT Operations spends a lot of time focusing on deployment, the hand-off from development teams, and automating much of the accompanying work, the bulk of the work is spent on keeping services that have already been provisioned running as long as they are available. I believe that the service itself, as well as the infrastructure that they run on, are all fully instrumented (or at least, they are supposed to be) – the problem is that the data generated by that instrumentation reveals some of its unique characteristics. The monitoring of a handful of servers and applications is one thing, but when it is expanded to thousands of systems, operators are going to find themselves drowning in an ocean of red alerts.
The Journey from ITOA to AIOps
Let’s take a closer look at analytics, or to put it another way, IT Operations Analytics – ITOA, as it is commonly known. A number of techniques were included in this category which are used to discover complex patterns in large volumes of often “noisy” data regarding IT system availability and performance as described by Gartner. The analytics techniques applied to the data tended to be fairly static and brittle, which meant that they couldn’t deal gracefully with changes in the infrastructure, even though the underlying IT infrastructureThe analytics techniques applied to the data tended to be fairly static and brittle, which meant that they couldn’t deal gracefully with changes in the infrastructure, even though the underlying IT infrastructure.
It was not a big issue as long as the operations teams were able to update their models as fast as necessary to keep up with the rate of change, everything was fine. The introduction of virtualisation, first of the compute and then of the network components, as well as the emergence of self-service provisioning and cloud computing in the past decade, however, has resulted in the pace of change accelerating vertically, with no sign of a slowdown, as new approaches to application delivery (Agile, DevOps, Continuous Integration and Continuous Delivery) continue to become more commonplace. The problem needs to be addressed in a new way.
As the real world changes at a faster pace than the models are able to keep up with, and the operations teams are deluged with ever more events, what should IT Operations Analytics look like within the context of a dynamic world? It is clear that dynamic approaches need to be used to conduct analysis, and a new category of tools is emerging to address this need. The new field of Algorithmic IT Operations has been characterized by Gartner as Algorithmic IT Operations, or AIOps, for short.
There is a convergence between AIOps and monitoring, service desks, and automation where AIOps excels. As we take all the data that is available from existing monitoring tools and combine it with algorithmic techniques to sort through and analyse these data, we will be able to deliver valuable insights to Operations, or in other words, we will be creating fewer tickets on the service desk of higher quality, with the goal of being able to deliver early warnings about developing problems as opposed to documenting a failure that has already occurred. In addition, these tickets may be connected to orchestration or run-book automation tools, enabling the quick resolution of problems as they are identified, as soon as they are identified.
To perform the analysis, the analysis must be conducted using dynamic, real-time algorithms instead of static models as the main approach. This will allow the continuous update of the rules and filters based on changes in the environment to be avoided, avoiding the time-consuming and labor-intensive procedure of manually updating the rules and filters.
In practice, AIOps can be reduced to the following four key components:
There is a big problem in IT Operations – the problem of too many alerts being generated as a result of repeating alerts in one channel or similar alerts being generated in many different channels at the same time. If the situation is not handled properly, this can lead to a full-blown alert storm in the worst case scenario. It is clear that one of the major goals of AIOps is to identify those duplicates – but crucially, to identify them without any need to define them beforehand, and to avoid discarding information that can be useful in the process.
There is still a risk of wasted effort if the significant needles in the monitoring haystack are only considered individually, after sifting out the significant needles from the vast haystack of data. It is common for methods of determining the relationship to require the configuration of the infrastructure and the application to be known and documented in advance – however, in a world where infrastructure might change or even move autonomously, on top of the existing problem of different teams making changes, it would be impossible to have a perfect model in such an environment. With AIOps, we propose to utilize algorithms that identify correlations automatically, solely based on the event stream itself, thus avoiding duplicate efforts by disconnected teams, to eliminate wasted time in the early phases of detecting and diagnosing incidents and to ensure data integrity.
Situation Workflow & Remediation
There is no doubt that detecting and diagnosing problems is just the start of the battle; IT Operations must also find a solution for the problems identified. As a result of AIOps, the different teams are able to collaborate effectively, with all events from the functional areas correlated algorithmically together, so that they can work together effectively. Collaboration between departments prevents unnecessary re-assignments and escalations, which in turn enables fluid communication between departments, as well as quicker resolution of incidents.
It’s a done deal, everyone can move on to the next one now! But don’t jump to conclusions just yet: what would happen if this happened again? If the same kind of incident recurs, then the traditional approach would be for the whole incident to be reviewed and documented in some sort of knowledge base or FAQ system so that if it recurs, everyone can refer back to the documentation in the event that the same type of incident recurs. There’s only one problem, though, and that’s the fact that these investigations are time-consuming and not particularly exciting, which is why they tend to be carried out only for the most serious incidents. The knowledge gathered from the investigation and resolution of lower-severity incidents is likely to stay in people’s heads or inboxes for a long time, becoming part of the “dark matter” of the organization’s knowledge. Instead of creating a separate knowledge base article to document the collaboration process itself, AIOps proposes to capture the process as it occurs and make that knowledge available automatically if similar events recur in the future.
How Does AIOps Improve IT Operations?
A few key IT Operations metrics have improved as a result of doing all of the above.
First and foremost, IT Operations will be able to improve their ability to detect and diagnose problems faster – ideally before end users are even aware of the problems, or at the very least before the impacts are too widespread.
In order to further reduce the overall incident duration, in addition to reducing the Mean Time To Detect, or MTTD, we can also accelerate our Mean Time To Resolve, or MTTR, in order to further reduce the overall duration of the incident. In order to achieve this goal, it is necessary for teams to collaborate more effectively and to avoid wasting time and effort.
It is a result of this that we will generally be able to reduce the overall number and duration of incidents and, in turn, all the other people that are dependent upon the quality of the IT Operations will have a much better experience.
The goal of ITOA was to let IT operations handle IT as it was: static, relatively slow to change. In essence, AIOps provides IT Operations with the tools to deal with IT in the way it is and how it will be in the future: dynamic, constantly evolving. As the business climate continues to undergo its own parallel transition to a quicker and faster rate of change and evolution, this is key to supporting the new needs of IT’s users as the business climate continues to experience its own changes algorithms are the only way to deal with the requirements that business will continue to place on IT.