The Legend of Production Support

Cage the Beast, Save the World

Written by: Irfan Shaikh, Manager, Cyber Group Inc.

There is a legend I heard from one of my managers about a wild, ferocious animal kept in a cage and a few children playing around it. The animal is taken care of by a few noble men and women. They make sure that the animal is fed regularly and treated with utmost care. If they fail in their duties, then the wild animal goes berserk, breaking the cage, and starts devouring each child one by one.

The story is about the real-world problem an organization faces due to production issues. The wild animal is your production applications, services, and databases; the noble-men are the developers, DBAs, and DevOps support. The children are the organization’s new initiatives.

The Legend of Production Support

Let us decode this legend. The wild animal, if not fed and taken care of, goes berserk and wild; similarly, your applications must be properly managed. The servers where they are hosted are supposed to be secure, updated with regular patches, API licenses renewed, server upgrades, etc.

The noblemen (app devs, DBAs, server support, network support, and DevOps support) that care for the animal must be focused and keep the system in check, having 24/7 monitoring in place, and have a pre-defined strategy to mitigate an outage like on-call support and alert notification to the on-call person.

The children here are the new initiatives an organization wants to achieve to bring more value to the business. The noble-men mentioned earlier are the team that works on these new initiatives and can bring more value when they are less distracted by production issues and focus on these new initiatives. When there has been no considerable effort on managing and monitoring production applications and services, the team will likely spend more time on production issues and less time on new initiatives which is when the animal goes wild and starts eating the children.

The moral of the story is to be proactive rather than being reactive. 

So how can you be proactive? The following are some key areas to focus on.

The Beast Is Agile – You Must Be Too

Having production support issues as a sprint goal would be a failed sprint right from the start. Nobody can anticipate the number of production issues the team will be facing in a sprint. So, the scrum framework is a big NO.

Kanban is an ideal framework for production support teams, but again that is not perfect. The best would be a combination which is a mix of Scrum and Kanban. You can name it ScrumBan.

24/7 Monitoring and Alert

What would be ideal, a customer calling and reporting a payment node outage or you, the on-call guy, getting an alert right when the node goes down? The latter option is worth it. Having a monitoring system that tells when there is an outage or warns about a possible outage due to load or memory issue is more valuable than having a fire drill when a customer reports an outage.

Prioritize and Permanently Fix Issues

Applying a permanent fix is the goal the team needs to adapt while fixing production issues. Temporary fixes create a system that is just like a car running on a spare tire. Proactively prioritizing issues encountered and fixing those as part of a preventive measure before it occurs again would be an approach that an experienced team would always take.

Modernizing the Legacy System

A glass of aged wine tastes the best. The problem is, unlike wine, software does not get better with age. Just think about why your computer is running Windows 10 and not Windows XP. Because it is faster with a robust UI and architecture. Similarly, as the legacy system gets old, its compatibility with the new technology ecosystem decreases, and it will get less support from the vendor.

Documentation is More Precious than Gold

For a production support team, the application support document is a gold mine. Be it an old employee or someone who has recently joined, the wealth of information that support documents can give when handling a production issue is a treasure.

Testing Is the Most Bittersweet Labor-Intensive Fruit

Go to a fruit garden and try tasting different fruits from different trees. You will realize that a day here or there, or small changes in temperature or handling, made as much of a difference to fresh fruits as small factors or changes do to software, except it all happens under the surface, with no way of telling the quality from the human eye alone. This is where automated testing comes into the picture. With automated testing, you can check for test results in real-time and find bugs or issues through automation much faster. The key aspect is to identify any production fixes needed without introducing any new unwanted issues.

With that, I rest my case. At the end of the day, it’s about focusing on what’s important.
Being proactive will benefit the organization, team, and bring more value. As problems are counteracted even before they are noticed, and in cases where they are noticed, the rectification is done immediately.

Also, it will keep the animal from the legend in the cage for a long time.