The other day I was talking with an IT Operations engineer who works at a large SaaS company in the Silicon Valley. We were supposed to have met up for lunch over the weekend, but he had abruptly cancelled it. So later he called up to apologize – turned out they had experienced an urgent firefight at work. Their SaaS service had gone down and their entire team including senior managers had spent the weekend trying to figure out where the problem was and how to solve it. Multiple teams had to be involved – Site Reliability Engineers, Network Administrators, Database Administrators and other subject matter experts (SMEs).

With 25+ people working together it was chaotic to say the least – different people were exploring different strategies, trying out various things, and it was nearly impossible to get an accurate picture of the investigation status. Bringing in new SMEs was required but not easy, as someone had to explain to them the current status – what had been seen and done already and what remained to be tried out. But did anyone really know the exact status of this distributed investigation with so many parallel threads and strategies?

The truth is, such complicated and stressful firefights are really quite common in the lives of IT Ops engineers. They are charged with maintaining the health and performance of their services and applications that are ever-growing in complexity and demands. Architectures and applications have evolved to handle millions of users and thousands of transactions per hour with very high reliability and 24×7 availability. Not surprising that maintaining such complex systems is a difficult job.

Once a major incident / problem occurs, the investigation process involves humans who need to come to a shared understanding of what is going on and how they should try to fix it. Problems are typically complex and cross-functional, may involve many different modules / services and often result in high pressure ‘war rooms’. Teams do use various tools such as call bridges, chats, emails and issue tracking systems to collaborate and coordinate actions. However, such collaboration can easily get fragmented, and manually searching for information can take up a lot of precious time. Larger team sizes actually make shared understanding a bigger challenge, as too many text-based emails and notes can be a significant burden to write, read and understand. The more the merrier? Not always!

So, what can be done to solve these difficulties and inefficiencies in the incident resolution process? Now that AI and Machine Learning are maturing as technologies, how can we apply them to make the lives of IT Operations teams easier and more productive? At smartQED we are passionately working on exactly these issues. We are enabling teams to do efficient and methodical Cause Analysis using ‘smart’ visual tools that are collaborative and also self-learning – they capture user actions and learn continuously from them. Solution and action recommendations are generated from prior solved problems, with the smartQED system acting as a behind-the-scenes ‘virtual expert’ that helps speed up incident resolution.

smartQED’s product is currently being piloted by a multinational telecom company who is using the technology to help streamline their IT operations.