Dear Customers
As many of you are aware, our hosting service provider 1-grid experienced large scale outage on Monday, 31 January 2022 evening affecting the vast majority of our customers. They have given a summary as to what happened, why the delay to resolve the issue and what they are doing to avoid further risks in the future.
From 1-grid:
Firstly, I would like to personally apologise to all affected customers. This is unacceptable and we let you down. This shouldn’t have happened and whilst I would like to say it was outside our control, our choice of suppliers and the extent to which we audit their setup is something we should have done better on. Saying that it affected hundreds of other companies doesn’t make it better. All we can do at this point is be brutally honest about the causes with our customers.
Current State
We were in the process of migrating from our current datacentre (owned by Old Mutual and run by Africa Datacentres) to a new site at the Africa Datacentres Diep River facility. The incident last night was not related to the migration, however its impact affected both sites, and this requires a bit of explanation
We have now moved all physical servers to our new site in Diep River which are in the process of being brought back onto our network.
What happened last night?
Last night, we saw every single link from Pinelands to our other datacentres go down at the same time, both transit (Internet connectivity) as well as two links to Teraco and the link to Diep River. This is a bit like a plane with four engines having all of them stop mid-flight at the same time. This caused an outage for most customers who were routed in Pinelands, even if their server had been moved to Diep River. The root cause was a Liquid Networks issue with a major failure in Pinelands.
We proceeded to carry out an emergency migration of routing from Pinelands to Diep River (something we were not scheduled to do quite yet). Because we are moving vendors for the equipment doing this routing, the process is a bit more complicated than it would otherwise be. Nonetheless we are doing this as we believe it will help get some customers back before Liquid address their own equipment failure. We have also made some temporary changes to bring customers up quicker in some cases. We’re working hard to get everyone up by the morning.
We have tonight also been physically moving all the remaining servers in Pinelands to Diep River as Liquid have indicated to us that they cannot fix the issue overnight. This is clearly not acceptable, however we would much rather take control of the situation and look after our customers. This is a significant undertaking as this was supposed to happen over the next couple of weeks spread over several nights. We have staff on hand to help with the move from our directors to technical team.
The technically minded among you will wonder ‘why didn’t you just move routing as you went along?’ – that’s a very good question. We have a lot of legacy setup from years of acquisitions and we’ve made iterative improvements to increase capacity, resilience and remove some legacy issues like large broadcast domains. Nevertheless servers on the same VLAN may be in different sites during the migration, so this process would have impossible to do perfectly, and we believed our three separate links between the sites would have been sufficient protection against any incidents.
We also found that out out-of-band access wasn’t working as expected; we have a setup that allows us to get into our routers even if our network was down. We have used this over the past few weeks however the setup had a glitch at the same time. This wasn’t the cause of the problem or a result of it, but it delayed us starting the diagnostic process. We will learn from that too and improve our monitoring.
The Future
We don’t propose to explain all the changes we will make in this post. We need to do a full incident analysis for that. However, in the meantime we would like to add a few comments.
Once again, I would like to personally apologise to all affected customers.
Secondly, as part of the datacentre move we already have plans to add another transit link in Teraco with a third party to work around Internet routing issues that sometimes arise (outside our control but which still affect our customers, and which we can ‘work around’). In addition to this, we will be re-examining the links Liquid provide to us to fully understand their setup and see whether we need another carrier for another link.
Smaller companies always suffer the most in the event of an incident, whilst larger ones (multinationals especially so) are assumed to simply suffer bad luck when things go badly wrong. We want to be open with our customers as to why this was so unexpected and that whilst we fully accept responsibility, it was a result of a failure which was difficult to expect. Nonetheless we should have questioned our suppliers’ assurances more. For that, we apologise unreservedly.
Yours sincerely,
Thomas Vollrath, Morne Patterson and the entire 1-grid team (who have worked tirelessly overnight to resolve this issue).