Outages: Difference between revisions

From CasperTech Wiki
Jump to navigation Jump to search
No edit summary
Line 18: Line 18:


Due to our failovers, no CasperVend transactions were lost.  Those using the very latest rental script (v1.32) should have also seen no issues. Those using older scripts may have experienced lost sales.
Due to our failovers, no CasperVend transactions were lost.  Those using the very latest rental script (v1.32) should have also seen no issues. Those using older scripts may have experienced lost sales.
We apologise for this incident (which was our fault) and we will be reviewing our processes to ensure that this doesn't occur again.


== 02/11/2015==
== 02/11/2015==

Revision as of 02:23, 29 December 2015

28/12/2015

Affected: CasperVend, CasperLet

Disruption length: 28 minutes

Service: Severely degraded

Fault status page: N/A - Internal

Lost transactions: None, providing in-world scripts are up-to-date.

We were performing maintenance on a database table, in order to increase the size of the "inventory name" field used for associating items with CasperVend. This is a huge operation which takes hours (it's a 35gb table).

With operations like this, we normally direct all database traffic to a single server, while we work on another. Unfortunately, on this occasion, we forgot to stop the replication, and so the operation automatically moved to the next server (which was the selected "production" server, handling all the traffic), which in turn caused the table to lock, which in turn caused the connections to build up and eventually the server stopped accepting connections.

The outage began at 4:56PM SLT. Norsk found the issue and reported it to Casper at 5:21PM, and it was fixed at 5:23PM.

Due to our failovers, no CasperVend transactions were lost. Those using the very latest rental script (v1.32) should have also seen no issues. Those using older scripts may have experienced lost sales.

We apologise for this incident (which was our fault) and we will be reviewing our processes to ensure that this doesn't occur again.

02/11/2015

Affected: All Services

Disruption length: 5 hours

Service: Severely degraded

Fault status page: http://status.ovh.com/?do=details&id=11304

Lost transactions: Minimal. Our critical processes were maintained.

At approximately 9:09 AM Second Life time on the 2nd of November, 2015, our hosting partners suffered a fiber cut, in a tunnel a few kilometers from the datacentre. This severed all of the datacenter's routes, with the exception of one very slow backup line.

We were aware of the problem immediately, and were in touch with the datacentre within minutes. They provisioned a special route for us which we could use for absolutely essential traffic only, flowing over the one remaining backup line.

The result of this was that all of our services were severely degraded. However, we were prioritising transaction and payment traffic over the backup route, so thankfully we were still processing transactions (around 10,000 transactions were processed during the outage, with roughly a 95% success rate).

The datacentre has four main trunks over two distinct physical paths, two heading north towards Montreal, and two heading south towards Newark. Following a similar outage that affected us in May, the two southern links had not yet been established (mostly due to a long four-year beurocratic process involved with laying cable over the international border between the US and canada). At the time, OVH (the hosts) assured us that they would have the new link to Newark in place by September.

Unfortunately, these new redundant southern routes were not in place yet. But by pure coincidence, they were only a couple of days away from being installed. Since the new routes were so close to being finalised, they were able to rush them into service. At 2pm SL time, the uplink was established and CasperTech services were restored.

We declared "ALL CLEAR" at 5:19pm SL time, once the datacenter had confirmed that the new routes were stable.

The links which had been severed were finally spliced together and restored at 9:10pm SL time.

This means that we now have two physically redundant routes to the datacentre, which means that any future fiber cut is very unlikely to cause a disruption in service.

28/05/2015

Affected: All Services

Disruption length: 1 hour 30 minutes

Outage: Degraded

Fault status page: http://status.ovh.com/?do=details&id=9603

Lost transactions: Minimal. Our critical processes were maintained.

At approximately 6pm SL time, a car collided with a telegraph pole belonging to our hosting partner, which severed connectivity to CasperTech services. The pole was located between the datacentre and montreal.

We were aware of the problem immediately, and were on the phone right away with the datacentre to discuss the problem.

While the datacentre does have several redundant pipes, they are all currently running along a single physical route. This is clearly unacceptable and we have contacted the host in order to establish what their redundancy plan is for the future.

We have been assured that this situation will be improved as early as July, and will be completely solved by September, when their southern route from Newark will be installed.

Service was restored at approximately 7:30pm SL time.