Stratford Data Centre Incident Report
Below is the incident report provided by TATA outlining the incident and the fix that has been put into place.
| Date |
Time in GMT |
Details |
| 26-Nov-09 |
16:45 |
Localized Power failure from Grid at Stratford IDC. The generator was kicked off to supply the power to the load. |
| 26-Nov-09 |
17:00 |
The commercial power was restored and after a period of run down time, the generators dropped out. the changeover mechanism attempted to restore the commercial power to the building |
| 26-Nov-09 |
17:05 |
When the generators dropped out, the main low voltage board was unable to restore the commercial power to the building load. During this time, the building was being supported by UPS’s and DC battery. |
| 26-Nov-09 |
17:20 |
Specialists were called to site and incident bridges were opened and troubleshooting began. In parallel, senior management was made aware and was briefed. |
| 26-Nov-09 |
17:30 -18:00 |
The UPS’s supporting the numerous colo rooms within the facility, having supported the load, switched off once autonomy time had been exhausted. At this time all colo and managed hosting customers had lost power supply to the respective equipments. Also the Core IP node and IP connectivity was lost due to loss of power. |
| 26-Nov-09 |
18:00 |
Troubleshooting showed that a problem existed on the Low Voltage panel whereby the mains supply and backup batteries had failed. This panel must be powered to enable breaker operation to switch commercial power throughout the building. The design of a temporary solution was immediately worked upon. |
| 26-Nov-09 |
18:00 – 22:00 |
The IDC team had worked on a more robust N+1 solution and they were preparing to implement this solution. |
| 26-Nov-09 |
19:25 |
Power was restored to the LV panel by means of a temporary domestic external 230v power source. This enabled operation of the breakers in the LV panel and we were able to restore power to the building at this point |
| 26-Nov-09 |
19:45 |
IDC team tested all input, output and battery supplies to the UPS’s before attempting to restore them to full load. All UPS’s were found to be in a normal condition with the exception of one UPS unit serving Colo 5 to managed hosting customers. |
| 26-Nov-09 |
22:20 |
All UPS’s were synchronized and the load was restored with the assistance of UPS specialists. |
| 26-Nov-09 |
23:00 |
The final UPS was tested and restored |
| 27-Nov-09 |
23:00 |
IDC team had sent the information regarding the final solution to Managed Services & IPNOC. The plan was to invoke the emergency maintenance notification for implementing the solution. |
| 27-Nov-09 |
00:05 |
The final solution involved a replacement battery string and a more permanent and resilient 230v supply. This plan was executed by IDC team without interruption to power. |
| Action Items |
| Temporary Fix |
Permanent Fix |
| Primary power supply (240 V AC) has been arranged from an alternative, protected supply |
To conduct a detailed analysis our UPS and M&E plant specialists have been on site to investigate the cause of the fault extracting logs, interrogating systems and non-intrusive tests to help us prevent any such outages in future. Based on the above analysis, necessary preventive have been taken. |
This entry was posted
on Tuesday, December 1st, 2009 at 1:44 pm and is filed under System Incidents.
You can follow any responses to this entry through the RSS 2.0 feed.
Responses are currently closed, but you can trackback from your own site.
It was a big issue but given the speed at which it happened I don’t think there is anything you could have done to make it any quicker or easier – well done Coreix Team!
Dec 01, 2009 @ 4:06 pm
Great job!
Dec 01, 2009 @ 4:57 pm
Stuff happens, all the time. Though it’s shame that such a simple mistake like providing power to the LV in order for it to work had not been forseen before, thus trashing your 100% uptime records. Well done though, took care of it in just a couple of hours.
Dec 01, 2009 @ 11:24 pm
Three observations from the incident report. Firstly, thank you for your transparency in reporting this. Second, I understand from the IR that the generators were stopped before the load was transfered to utility power? Would it not have been better to leave the generators running until the switch to utility power could be completed? Finally, when it became apparent that the switch back to utility power had failed, could you not restart the generators?
Dec 09, 2009 @ 3:41 pm
Hi Mike,
Thank you for your comments, I will send you an email to answers your outlined questions.
Dec 09, 2009 @ 4:34 pm