Power Maintenance – Sunday 20th Dec 2009
Coreix :: Power Maintenance – Sunday 20th Dec 2009
Location: STA.LON2
---------------------
The facility engineering team will be conducting power maintenance beginning at 02:00 GMT lasting until 06:00 on Sunday 20th Dec 2009.
Scope of work: Functional test of the new enhanced solution.
The commercial supply will be disconnected in a controlled manner by power engineers. This work will ensure the correct operation of the breakers in the Low Voltage equipment in switching the power source on to the Generators. A full risk assessment has been carried out and as an additional precaution; manufacturers will be onsite along with our data-centre engineers.
Outage: Non-service impacting activity, we do not anticipate any downtime during this testing.
Thank you for your understanding as we continue to improve our services.
_____________________________
STATUS UPDATE
We have successfully completed the scheduled maintenance.
Power Maintenance – Sunday 6th Dec 2009
Coreix :: Power Maintenance – Sunday 6th Dec 2009
Location: STA.LON2
---------------------
The facility engineering team will be conducting power maintenance beginning at 02:00 GMT lasting until 06:00 on Sunday 6th Dec 2009.
Scope of work: Functional test of the new equipment installed during the maintenance on the 4th Dec 2009.
The commercial supply will be disconnected in a controlled manner by power engineers. This work will ensure the correct operation of the breakers in the Low Voltage equipment in switching the power source on to the Generators. A full risk assessment has been carried out and as an additional precaution; manufacturers will be onsite along with our data-centre engineers.
Outage: We do not anticipate any downtime during this testing.
Thank you for your understanding as we continue to improve our services.
_____________________________
STATUS UPDATE - 6th Dec 2009 04:19
The above mentioned maintenance activity planned for 6th Dec 2009- 0200-0600 hrs GMT has now been postponed until further notice.
During our pre-change checks we identified a fault which could lead to a degraded service on the cooling system and further investigation with the relevant engineering staff established that given a certain sequence of events full cooling to the data halls during the maintenance could be affected.
After a full risk assessment it was agreed that the increased risk was unacceptable and led to the decision that the change should not go ahead.
We apologise for cancellation of this change and will attempt to rearrange it as soon as possible when the risk profile can be reduced to an acceptable level and we will endeavour to give as much notice as possible for the future scheduling of this change.
Power Maintenance – Friday 4th Dec 2009
Coreix :: Power Maintenance – Friday 4th Dec 2009
Location: STA.LON2
---------------------
The facility engineering team will be conducting power maintenance beginning at 03:00AM GMT on Friday 4th Dec 2009
Scope of work: Testing LV switch gear and failover from Primary to Backup power.
Outage: We do not anticipate any downtime during this testing.
Thank you for your understanding as we continue to improve our services.
_____________________________
STATUS UPDATE - 4th Dec 2009 08:00
We have successfully completed the scheduled maintenance. There will be further non-intrusive maintenance, which is currently in the planning stage; this will fully test new components installed last night. We are planning to get another change window this weekend.
Stratford Data Centre Update
Our customers in the TATA Stratford, London facility have been directly affected by the power outage on Thursday 26th November 2009. During this period customers did not receive the normal level of service Coreix provides and we are working with TATA to ensure an incident of this magnitude does not occur again in the future. We have now completed our initial review of this incident and received a preliminary report as to its cause. To ensure transparency and accuracy in this matter we have provided an excerpt from this report below.
Why did this incident occur?
This issue occurred due a failure of the switch gear that allows us to move between the power grid and backup power sources. This problem caused the site to not return to grid power once it became available. You can view an excerpt from the TATA’s preliminary report at http://status.coreix.net/?p=15 which gives a full report on how the facility handled this incident and currently proposed solutions to ensure this incident is not repeated.
What is Coreix’s major incident procedure?
At the current time our major incident procedure for a power outage is as follows:
- Staffing procedures:
- Contact all staff not currently on shift to arrange for them to immediately head to the datacentre.
- All technically capable staff located at other sites to be diverted to the datacentre.
- One staff member to be designated to deal with ticket management, external monitoring of progress and updating of status page.
- Deploy an automated telephone message on all phone systems pointing clients to http://status.coreix.net.
- Once service has been restored:
- Network engineers to ensure that all core services come on line while the technical staff ensures that all racks are powered up.
- Staff to be split half and half between responding to all technical issues within tickets and bringing servers online based on monitoring alerts.
- Once all monitoring alerts and support tickets have been dealt with:
- Restore telephone service.
- Start investigation into root cause of issue.
While we understand that customers would like to be able to contact us by telephone in this situation we feel this is not viable as it would double, if not quadruple the time taken for us to restore service to all clients.
What happened on the day?
On the day we were able to contact the majority of our staff and had a double compliment of staff on site before the power was resumed to the site. The phased start-up proceeded according to plan, with core services being returned within 10 minutes of power on and all racks powered up within 15 minutes of power on.
Due to a DNS failover error our status page was not available and we were unable to resolve this issue until power was returned to the site; at this point we manually pointed the DNS to our offsite backup.
Once power was resumed, all mail sent during this period spooled into our support desk. A mass e-mail was sent to all open tickets to confirm power had been restored which eliminated clients that had returned to full service. This procedure allowed us to concentrate solely on clients with continuing issues.
Our first priority thereafter was getting customers back online whose servers had not powered up cleanly after the hard reboot. As customer uptime is a core principle of our service we worked non stop until we were happy that all clients where back online, at this time we enabled our phone system and moved back to providing our standard service.
What has Coreix learnt from this incident?
While we feel that the technical element of our response ran as efficiently as possible there are several areas where we can see room for improvement, mainly in relation to customer contact during this period:
- Status Page
This we feel was our major failing; our status page was set up to failover from an onsite location to an offsite location in the event of a power or network outage. This process did not work as expected and the issue was compounded by the fact that we could not update the DNS to manually point to the offsite location until our primary DNS came on line. This was due to the fact that all records replicate out from this server. We have already resolved this for future incidents by permanently running this service offsite.
- Ticket Management
While we were able to rapidly deal with all technical issues we found limitations in utilising the ticket system to prioritise issues when we had a couple of hundred issues open at once. This resulted in there being no response to tickets until a ticket saying the issue had been resolved. We understand that this is not ideal for our clients and we are currently trailing alternative systems to allow us to queue tickets in a manner that will allow us to provide more feedback.
The future
With regards to the site itself while we are concerned in regards to this incident and will be further investigating it to ensure that this issue does not occur again. We feel that we must point out that this site has been extremely reliable in the past and is the first power incident in our five years here. The site held load perfectly through at least 6 or 7 other local power issues in this time and is in fact one of the few sites in London that hasn't had any power issues until now.
Closing Statement
On behalf of Coreix, I sincerely apologise for the disruptions to your business and for the lack of information provided during this trying time. We fully understand that these failures negatively impact the businesses of our customers. Coreix has been working hard since this incident occurred to ensure that in any subsequent event we will be able to provide improved communication channels to allow all clients to receive regular updates without slowing the technical resolution down.
If you would like to talk to us further about this incident please contact info@coreix.net and we will be happy to assist you. If you would additionally like us to keep you informed regarding the changes to procedures made by both TATA and Coreix in relation to this incident please contact the Operations Director, Shazad Boota via s.boota@coreix.net.
As always, your feedback is welcome. Please be honest with us about your expectations and how we can improve our service. We will do our best to restore your trust in us and I want to thank you on behalf of Coreix for standing by us during this difficult period. I’d also like to extend our thanks to our technical team for their tireless efforts to deliver our service and support our clients over this period.
Shazad Boota
Operations Director, Coreix Limited
Stratford Data Centre Incident Report
Below is the incident report provided by TATA outlining the incident and the fix that has been put into place.
|
Date |
Time in GMT |
Details |
|
26-Nov-09 |
16:45 |
Localized Power failure from Grid at Stratford IDC. The generator was kicked off to supply the power to the load. |
|
26-Nov-09 |
17:00 |
The commercial power was restored and after a period of run down time, the generators dropped out. the changeover mechanism attempted to restore the commercial power to the building |
|
26-Nov-09 |
17:05 |
When the generators dropped out, the main low voltage board was unable to restore the commercial power to the building load. During this time, the building was being supported by UPS’s and DC battery. |
|
26-Nov-09 |
17:20 |
Specialists were called to site and incident bridges were opened and troubleshooting began. In parallel, senior management was made aware and was briefed. |
|
26-Nov-09 |
17:30 -18:00 |
The UPS’s supporting the numerous colo rooms within the facility, having supported the load, switched off once autonomy time had been exhausted. At this time all colo and managed hosting customers had lost power supply to the respective equipments. Also the Core IP node and IP connectivity was lost due to loss of power. |
|
26-Nov-09 |
18:00 |
Troubleshooting showed that a problem existed on the Low Voltage panel whereby the mains supply and backup batteries had failed. This panel must be powered to enable breaker operation to switch commercial power throughout the building. The design of a temporary solution was immediately worked upon. |
|
26-Nov-09 |
18:00 – 22:00 |
The IDC team had worked on a more robust N+1 solution and they were preparing to implement this solution. |
|
26-Nov-09 |
19:25 |
Power was restored to the LV panel by means of a temporary domestic external 230v power source. This enabled operation of the breakers in the LV panel and we were able to restore power to the building at this point |
|
26-Nov-09 |
19:45 |
IDC team tested all input, output and battery supplies to the UPS’s before attempting to restore them to full load. All UPS’s were found to be in a normal condition with the exception of one UPS unit serving Colo 5 to managed hosting customers. |
|
26-Nov-09 |
22:20 |
All UPS’s were synchronized and the load was restored with the assistance of UPS specialists. |
|
26-Nov-09 |
23:00 |
The final UPS was tested and restored |
|
27-Nov-09 |
23:00 |
IDC team had sent the information regarding the final solution to Managed Services & IPNOC. The plan was to invoke the emergency maintenance notification for implementing the solution. |
|
27-Nov-09 |
00:05 |
The final solution involved a replacement battery string and a more permanent and resilient 230v supply. This plan was executed by IDC team without interruption to power. |
| Action Items | |
|
Temporary Fix |
Permanent Fix |
| Primary power supply (240 V AC) has been arranged from an alternative, protected supply | To conduct a detailed analysis our UPS and M&E plant specialists have been on site to investigate the cause of the fault extracting logs, interrogating systems and non-intrusive tests to help us prevent any such outages in future. Based on the above analysis, necessary preventive have been taken. |
Power Outage at TATA Stratford Facility
At approximately 16:48 GMT the Stratford, London facility lost mains power from the power grid. The time-line of events is as follows:
16:48 - Power to site lost - running on UPS.
17:15 - UPS systems depleted, generators failed to start.
17:30 - Generators failovers failed to function despite multiple attempts - The power engineers were dispatched.
18:32 - The power engineers arrived on site.
18:54 - The power engineer estimates 30-45 minutes to return power to site.
19:16 - The power was returned to site and the process of booting up each rack commenced.
19:35 - All racks powered up and brought on-line.
The last twenty-four hours have seen numerous power grid blips but the system as designed have taken the load with the UPS battery backup and generators keeping the facility live, ensuring an un-interrupted service.
Tonight however, at 16:48 a failure in the control boards prevented the generators from powering the facility once the power from the grid failed, a manual bypass was installed to get the facility on-line.
The facility power engineers are continuing to monitor the facility and extra staff were drafted in to help.
If you are still experiencing any technical issues please contact the support desk via https://support.coreix.net/, however if you have any further questions or would like any clarification please do hesitate to contact the Operations Director, Shazad Boota directly at s.boota@coreix.net
As a valued customer, we thank you for your understanding during this period.