Outages: Difference between revisions
No edit summary |
No edit summary |
||
Line 3: | Line 3: | ||
'''Affected:''' CasperVend, CasperLet | '''Affected:''' CasperVend, CasperLet | ||
'''Disruption length:''' | '''Disruption length:''' 31 minutes | ||
'''Service:''' Severely degraded | '''Service:''' Severely degraded | ||
Line 15: | Line 15: | ||
With operations like this, we normally direct all database traffic to a single server, while we work on another. Unfortunately, on this occasion, we forgot to stop the replication, and so the operation automatically moved to the next server (which was the selected "production" server, handling all the traffic), which in turn caused the table to lock, which in turn caused the connections to build up and eventually the server stopped accepting connections. | With operations like this, we normally direct all database traffic to a single server, while we work on another. Unfortunately, on this occasion, we forgot to stop the replication, and so the operation automatically moved to the next server (which was the selected "production" server, handling all the traffic), which in turn caused the table to lock, which in turn caused the connections to build up and eventually the server stopped accepting connections. | ||
The outage began at 4: | The outage began at 4:52PM SLT. Norsk found the issue and reported it to Casper at 5:21PM, and it was fixed at 5:23PM. | ||
Due to our failovers, no CasperVend transactions were lost. Those using the very latest rental script (v1.32) should have also seen no issues. Those using older scripts may have experienced lost sales. | Due to our failovers, no CasperVend transactions were lost. Those using the very latest rental script (v1.32) should have also seen no issues. Those using older scripts may have experienced lost sales. |
Revision as of 13:11, 29 December 2015
28/12/2015
Affected: CasperVend, CasperLet
Disruption length: 31 minutes
Service: Severely degraded
Fault status page: N/A - Internal
Lost transactions: CasperVend: None. CasperLet: None, providing in-world scripts are up-to-date.
We were performing maintenance on a database table, in order to increase the size of the "inventory name" field used for associating items with CasperVend. This is a huge operation which takes hours (it's a 35gb table).
With operations like this, we normally direct all database traffic to a single server, while we work on another. Unfortunately, on this occasion, we forgot to stop the replication, and so the operation automatically moved to the next server (which was the selected "production" server, handling all the traffic), which in turn caused the table to lock, which in turn caused the connections to build up and eventually the server stopped accepting connections.
The outage began at 4:52PM SLT. Norsk found the issue and reported it to Casper at 5:21PM, and it was fixed at 5:23PM.
Due to our failovers, no CasperVend transactions were lost. Those using the very latest rental script (v1.32) should have also seen no issues. Those using older scripts may have experienced lost sales.
We apologise for this incident (which was our fault) and we will be reviewing our processes to ensure that this doesn't occur again.
02/11/2015
Affected: All Services
Disruption length: 5 hours
Service: Severely degraded
Fault status page: http://status.ovh.com/?do=details&id=11304
Lost transactions: Minimal. Our critical processes were maintained.
At approximately 9:09 AM Second Life time on the 2nd of November, 2015, our hosting partners suffered a fiber cut, in a tunnel a few kilometers from the datacentre. This severed all of the datacenter's routes, with the exception of one very slow backup line.
We were aware of the problem immediately, and were in touch with the datacentre within minutes. They provisioned a special route for us which we could use for absolutely essential traffic only, flowing over the one remaining backup line.
The result of this was that all of our services were severely degraded. However, we were prioritising transaction and payment traffic over the backup route, so thankfully we were still processing transactions (around 10,000 transactions were processed during the outage, with roughly a 95% success rate).
The datacentre has four main trunks over two distinct physical paths, two heading north towards Montreal, and two heading south towards Newark. Following a similar outage that affected us in May, the two southern links had not yet been established (mostly due to a long four-year beurocratic process involved with laying cable over the international border between the US and canada). At the time, OVH (the hosts) assured us that they would have the new link to Newark in place by September.
Unfortunately, these new redundant southern routes were not in place yet. But by pure coincidence, they were only a couple of days away from being installed. Since the new routes were so close to being finalised, they were able to rush them into service. At 2pm SL time, the uplink was established and CasperTech services were restored.
We declared "ALL CLEAR" at 5:19pm SL time, once the datacenter had confirmed that the new routes were stable.
The links which had been severed were finally spliced together and restored at 9:10pm SL time.
This means that we now have two physically redundant routes to the datacentre, which means that any future fiber cut is very unlikely to cause a disruption in service.
28/05/2015
Affected: All Services
Disruption length: 1 hour 30 minutes
Outage: Degraded
Fault status page: http://status.ovh.com/?do=details&id=9603
Lost transactions: Minimal. Our critical processes were maintained.
At approximately 6pm SL time, a car collided with a telegraph pole belonging to our hosting partner, which severed connectivity to CasperTech services. The pole was located between the datacentre and montreal.
We were aware of the problem immediately, and were on the phone right away with the datacentre to discuss the problem.
While the datacentre does have several redundant pipes, they are all currently running along a single physical route. This is clearly unacceptable and we have contacted the host in order to establish what their redundancy plan is for the future.
We have been assured that this situation will be improved as early as July, and will be completely solved by September, when their southern route from Newark will be installed.
Service was restored at approximately 7:30pm SL time.