It was a beautiful evening on Saturday, the 9th of December 2017. Back then I was still working at Bigpoint, a games company based in Hamburg, Germany.
A few friends and me were watching Star Wars (in Machete order - to whom it may concern) at another friends house.

It was my first weekend on-call and I was unsuspecting. And so the story begins.

The realization

At around 10 pm of that very same December 9th I received some strange calls.
“We are getting reports about players not being able to play Farmerama”.
“Do you know anything about why DarkOrbit instance XZY is down?” “Are you performing a maintenance on a weekend?!”.
And other - similar sounding - messages.

So I went on and tried to reach the websites of our games. Nothing; Sh*t.
That is the time where I felt, that something was fishy; Radio silence from the monitoring.
It should really have alerted by now.
I went to login into the VPN in order to check on the monitoring. With no success. The connection is established OK, but the authentication of the VPN times out.

That is the point where I did the call, that should ruin my whole weekend.
“We are completetly down”, I told to both my shadow and our superiour on the phone.
Shit's on fire meme

The backdoor

There was no point of entry into the infrastructure. Or so it seemed at first.
I remembered, that I just bootstrapped a machine with a public IP a few days ago while building some automation. That was still running - And I was in.

Kinda. I was inside the Bigpoint network, but that quite honestly din’t do sh*t. DNS was out and all critical machines such as our main monitoring etc., where running on Kerberos authentication (for which you need VPN- or Office-access).
The back-door to the amusement park was open, but the cashier guarding the entrance wanted to see my ticket. (Quite literally, a Kerberos-ticket) This is when we hit rock bottom for a while.

A thing I learned to love at Bigpoint was our Server DB. It is a service (+cli tool), that contains pretty much all the info about all our servers, including DNS records, reverse-DNS records, hosted games, network information, IP allocation. Anything infrastructure-related was in that thing. Awesome! But SSH did not work, and it’s ‘API’ was/is some hot mess, but I just developed a platform-independent interface tool for that mess of an API.
So after some SSH tunneling I finally got access to query my way to IPs of machines that I deemed important. But that also tunred out to not be very helpful. I need to get my notebook on that network somehow.

The Office

It is around midnight now. I am resigning my backdoor and pack my things to make my way to the Office. Why? Well, the Office is hardwired to our datacenter in Nuremberg (via some VPNs that actually work most the time), plus there is working Kerberos there :)

Arriving after about half an hour I found out, that I made the plan without our Office security guy. (Poor guy btw. Has to guard and do tours…) The entrance to the door was locked as he did his tour and the emergency phone vibrated on the desk in the lobby. Not so hot. (pun intended; it was cold outside) 30-somewhat minutes later he finally showed up and I could badge into the Office.

It’s now 1 am; 3 Hours of being completely knocked away from the face of the internet. Equipped with the power of a direct connection I was finally able to check on our monitoring, databases etc. Everything - and I mean everything was red.

Okay, it’s on! Conference call engaged.

Oh VMware, what have you done?

We (my shadow, my superiour and me) split up to cover as much ground as possible. The first thing we found out, is that the machine for handling in-line 2FA for our VPN was knocked offline in the same way all the other machines were knocked out.

VMware.

Now, first off VMware was technically not at fault, it was mainly our legacy-driven wonky setup of it. But still, it ultimately sealed the deal of knocking us out.

A little detour on Bigpoint Infrastructure. (As detailed as I can vaguely express.)

Think about you have multiple availibility zones, a big VMware cluster (at one point the biggest install for a company in northern Germany) and a huge, all-flash NetApp storage (and an old EMC one that is about to be decommissioned). Previously the cluster was devided up in multiple smaller clusters that each got their own share of disk. This methodology was kept for the new NetApp storage. It was ‘partitioned’ (for lack of a better word) into smaller chunks to be distributed to the clusters, only now each cluster could also access multiple ‘partitions’.

And we had snapshots enabled. <– This was our biggest mistake.

Due to snapshots being taken by the NetApp any decreases in the VMs and compactions of VMware did not reduce the utilized size. So VMware did storage VMotions for the VMs. Which also did not release any space on the originating partition. Effectively filling up the target partition so much, that VMware again did a storage VMotion to yet another datastore. Rinse and repeat until all storage is exhaused.
Seriously GIF

At this point, pretty much all VMs are in limbo, storage VMotions on 2.5k VMs are creating so much I/O, that the I/O inside the VMs effectively stalls and latency → ∞. This caused the (Linux) VMs to remount their disks are read-only. Making pretty much everything horrible. And it’s not getting any better.

Anyways, now we decided, that we need to free up space. During that time we did not yet know about the snapshots being the problem, nor the storage VMotions. So we deleted some VMs that were easy targets and not critical for business to regain some space.
My first targets were database-replications of some Drakensang instances (again, sorry to you guys ❤️). Freeing up around half a terabyte in a few minutes. It helped a bit, but storage utilization was still rising. We did not really care about that at first. But then we found out that VMware was doing a lot of storage VMotions. Like, all the time. Continously, for - every - freaking - VM. And we had 2.5k of those.

Resolution

After deleting lots and lots of redundant systems, we finally got VMWare stable again.
The thing was, that most of the core infrastructure in out main datacenter was still down. Most notably, the box, that handles MFA for the VPN. That core infra was the first priority to restore (well, obviously…).

After the network and infra was up again, it was time for damage assessments. There were lots of machines, that had to be gently kicked to reboot. In order to get them to operate as usual.
Lukily there was no permanent damage to any filesystems, except f*ing some NFS shares. But a reboot also took care of those.

The final stand

It was about 8 am. The last machine, that was still in a not-so-nice state, refused to get a Kerberos ticket. I just could not get it to get one. It was not a very important machine, so I just went home.

Once my colleague arrived at the office, he did what I did not do again. Reboot the node. And of course that fixed it. Come on!?

While I was sleeping, the rest of the IT team was busy running root cause analysis. This is where we found out, that the snapshots on the NetApp storage were the indeed the root cause for all of this mess.

Lessions learned

Do not run administration infrastructure for an on-prem cluster in that cluster
If you run snapshots on your storage, make sure it is one storage pool. Or have multiple separate ones. Or at least take care, that VMs will not be moved back and forth between partitions of the same storage.
Have redundant machines. You can delete their data if you did not manage to do nr. 1.
If in doubt, reboot. Well… Make sure rebooting does not cause data corruption, first.
If you can, just rent your sh*t.
- While I personally enjoy digging in the dirt. For a company, it might be just better to rent the underlying infrastructure. (Cloud, whatever)

← Previous Post Next Post →

Battlestory: My first outage