Outages
02/11/2015
Affected: All Services
Disruption length: 5 hours
Service: Severely degraded
Fault status page: http://status.ovh.com/?do=details&id=11304
Lost transactions: Minimal. Our critical processes were maintained.
At approximately 9:09 AM Second Life time on the 2nd of November, 2015, our hosting partners suffered a fiber cut, in a tunnel a few kilometers from the datacentre. This severed all of the datacenter's routes, with the exception of one very slow backup line.
We were aware of the problem immediately, and were in touch with the datacentre within minutes. They provisioned a special route for us which we could use for absolutely essential traffic only, flowing over the one remaining backup line.
The result of this was that all of our services were severely degraded. However, we were prioritising transaction and payment traffic over the backup route, so thankfully we were still processing transactions (around 10,000 transactions were processed during the outage, with roughly a 95% success rate).
The datacentre has four main trunks over two distinct physical paths, two heading north towards Montreal, and two heading south towards Newark. Following a similar outage that affected us in May, the two southern links had not yet been established (mostly due to a long four-year beurocratic process involved with laying cable over the international border between the US and canada). At the time, OVH (the hosts) assured us that they would have the new link to Newark in place by September.
Unfortunately, these new redundant southern routes were not in place yet. But by pure coincidence, they were only a couple of days away from being installed. Since the new routes were so close to being finalised, they were able to rush them into service. At 2pm SL time, the uplink was established and CasperTech services were restored.
We declared "ALL CLEAR" at 5:19pm SL time, once the datacenter had confirmed that the new routes were stable.
The links which had been severed were finally spliced together and restored at 9:10pm SL time.
This means that we now have two physically redundant routes to the datacentre, which means that any future fiber cut is very unlikely to cause a disruption in service.
28/05/2015
Affected: All Services
Disruption length: 1 hour 30 minutes
Outage: Degraded
Fault status page: http://status.ovh.com/?do=details&id=9603
Lost transactions: Minimal. Our critical processes were maintained.
At approximately 6pm SL time, a car collided with a telegraph pole belonging to our hosting partner, which severed connectivity to CasperTech services. The pole was located between the datacentre and montreal.
We were aware of the problem immediately, and were on the phone right away with the datacentre to discuss the problem.
While the datacentre does have several redundant pipes, they are all currently running along a single physical route. This is clearly unacceptable and we have contacted the host in order to establish what their redundancy plan is for the future.
We have been assured that this situation will be improved as early as July, and will be completely solved by September, when their southern route from Newark will be installed.
Service was restored at approximately 7:30pm SL time.