News cycles, work days, online shopping, we now expect pretty much every aspect of our lives to be on and available 24/7. If your core platform services experience down-time to the point where applications are impacted and users or other stakeholders noticed or were inconvenienced, you know the crushing sense of frustration this can bring.
No system is completely impenetrable. Down-time events will happen. And when they do, what you can and must do is put strategies in place to minimise the likelihood, and mitigate the impact, of future events.
Was it only a matter of time?
There could be all kinds of specific, technical or human error issues that led to the down-time. And you of course need to identify these. That will help you understand what happened. But in a bigger picture sense the main question you need to answer is: Did the original requirements adequately cater for the needs of the business?
Consider: in the early planning stage, did you:
Take into account the right number of users, detail the functionality required to support the various applications and the priority level of these applications?
Prioritise these in order of importance and impact?
Align the goals of your technology platform with these needs and factors
Review your original documentation against these questions and considering what might have changed in the business since the platform was set up. It may be that the planning process accurately identified the requirements – at the time! – but the business has evolved and the platform hasn’t kept up. Or perhaps there was always a gap.
You may also discover that there are no documents which capture the original plans. Best to learn that now and address it immediately, creating a new and current set of requirement and planning documents.
Prepare now to avoid future events
With a clearer picture of what happened and any gaps in the initial planning process, the next step is to review how well the existing systems and processes support the business needs. In doing so, consider the following:
1. Does your design support your requirements?
Issues to include in the design review include: user numbers, critical applications and shifting business needs. If, for example, the platform was originally planned to cater for 10,000 users but the design can only support 1,000, there is an obvious gap. If your plans identify a critical application with a zero down-time tolerance, can your design deliver this? Or have you identified single points of failure that need to be fixed?
A design review may take a few hours or a few days, depending on the complexity of your system and the expertise of your platform management team. The resulting list of remediation actions can take days or months to implement depending on the depth of the problem, and if significant re-design is required.
2. How effective is your monitoring?
Your biggest clue in answering this question may lie in who detected the platform down-time issue in the first place. If it was detected internally, that’s a good sign. If it was detected and reported by an end-user, then you need to review and improve how your systems are monitored.
It is critical to implement proactive, as opposed to reactive, monitoring. As a rule, if monitoring is managed by a generalist support team they will look for symptoms and so only respond once something goes wrong. A specialist, on the other hand, will be well versed with what could go wrong and will be actively monitoring for first signs of potential trouble. Read more here about using generalists versus specialists for your enterprise platform management needs.
3. Are your SLAs in place and aligned?
Your Service Level Agreements must ensure alignment between business needs, end user expectations and the sliding scale of priorities between these two elements.
Following a major down time event, difficult conversations with your internal customers inadvertently follow. But the ‘blame game’ serves no-one – the important thing is to analyse what is needed and how the needs can be met, not who failed to meet expectations or to establish them in the first place. Priorities and pragmatism are two useful things to keep in mind.
For example, it might be a priority that every platform and application is a ‘no down-time’ scenario. But how pragmatic is this? Is it overloading your system and inevitably contributing to down-time events?
Take time to review, revise and document your SLAs across different departments and applications, ensuring expectations, understanding and actions are all aligned.
Responding to a significant platform down-time event is stressful. Addressing what happened is important but don’t get too tunnel-focussed on this. Use the opportunity to review the original set-up, align this with what your business needs are now, and make sure you plan adequately for the future. If you’re not sure how to proceed, give us a call. Often an outside perspective can help clear the air after a difficult IT situation and provide options for the best way forward.
Comments