Kubernetes Failure Stories

A compiled list of Kubernetes Failure stories. Not how Kubernetes fails but war stories about how companies experienced Kubernetes failures and the post mortem of such events. A great resource and a bit of transparency into what happens in the real world.

Website - https://k8s.af/
Henning Jacobs Head of Dev Productivity Zalando- https://twitter.com/try_except_
  1. Nordstrom - 101 to crash your cluster
  2. Target Cascading Failures
  3. Moonlight
  4. Skyscanner
  5. Zalando
  6. Monzo public Post Mortem

Episode Transcription

Welcome back to The Byte. In this episode, we're going to be learning about Kubernetes failure stories, and I'm not talking about Kubernetes as a product failing, but more actually clusters failing in production and lessons learned, so postmortems.

A gentleman by the name of Henning Jacobs, Head of Development Productivity at Zalando, and his Twitter is @tryexcept, which is pretty cool, his mission was actually to learn from other companies and how other people actually dealt with issues and documented these issues and what was their next steps in finding the way forward, and I guess his motivation was from KubeCon 2017 and Nordstrom, giant retailer in the US, talked about 101 ways to crash a cluster, which is quite interesting, and he took it upon himself to start making a public repo and GitHub to list all the failures that he could find and start listing them out and publicly telling people what happened and how to resolve these issues.

Companies, such as Zalando is one of them, Skyscanner, Target, Nordstrom, Toyota, Monzo, AirMap, there's a list of about 20, 25 stories, very, very good. The stories that I took a look at just for me, Target was one of the big ones. Target's an American retailer, and they talk about how their cascading failure from logging side cars caused so many elections in Consul, it actually took Consul offline, so the logging was actually congesting the mesh traffic so much that Consul couldn't even register a new note. This is a major deal. This is a production system, major, major retailer, and it just goes offline. It has a cascading effect because as more services kept trying to elect, it just kept going further and further down the rabbit hole of chaos, and the fix was actually to enable encryption on the Consul, which then threw out all the bad messages. It's a really cool story.

Daniel Woods, the gentleman in charge of architecture at Target, detailing exactly what happened, their steps to remedy it, and what their plan is in the future to get away from these types of problems.

The other story is Moonlight, and moonlight is like moonlighting for jobs. It's like a job portal, quite interesting. This was actually another failure story where the scheduler was assigning multiple pods with high CPU on the same node, and then this node actually went into kernel panic, and it crashed the node, and the scheduler then took these same pods again and tried to reschedule them on another node, and just crashed the next node, and you can see it just kept cascading and getting worse and worse as more and more nodes went offline. It became so much an issue that it just huge issue at this production level. What fixed it was setting some anti-infinity rules and then the nodes started repairing themselves. Again, a great postmortem on this, as well.

Skyscanner, I don't know if you know Skyscanner or not. It's one of these apps where you can track airplanes and see what's going on, quite a cool app. Well, this one was a new flag was available in the templating. They implemented it, and it just took, rescheduled all their pods. This is quite major. One flag change basically took down their production environment, pretty bad. They didn't catch it at testing. They did all their research on it to see what was actually this flag was doing, and it slipped through the cracks. They fixed it. They actually opened up a GitHub issue pull request, and they fixed it, they documented it, and they detailed it out.

As I was looking through these issues and I was researching everything, GitHub actually went offline. I don't know if it's a coincidence or not, but quite interesting.

Nonetheless, we can learn a lot from these postmortems. I'm a big fan of postmortems. It comes from the SRU world, Site Reliability Engineering from Google, and how we learn from these, and it's from others. When people share their stories, we understand. Hey, when this happens, this what we should look for and resolve these issues.
Zalando, actually Henning Jacobs is from Zalando. He actually wrote his failure stories, which was pretty recent, where core DNS went offline, and a single application actually starts querying DNS, and every DNS lookup resulted in 10 queries, so you can see it cascaded overnight. It just kept on going further and further down the rabbit hole, once again because it just exponentially starts taking DNS services offline. One application was basically doing a denial service to their entire cluster. Their fix was basically better monitoring and more isolation of the services.

I highly recommend you take a look at this Kubernetes failure stories even if you don't use Kubernetes. It's great how they write these postmortems, how they document everything and keep everything public. Some companies even actually make a GitHub issue and publish the issue and how it's actually running. That's quite interesting, as well.Give some feedback. Look at it, and learn from others' failures and how we can better work as a community, learning from our failures and communicate to the rest of the community how to do better.

That's all for this episode. We'll see you next time.
© 2020 TheByte