How to Love Kubernetes and Not Wreck The Planet Part II: Revenge of the Zombies

7 min readNov 3, 2020

What are the climate impacts of our technology habits? Part I of this blog discusses the importance of multi-tenancy, utilisation, and elasticity. So what happens if we establish good multi-tenancy patterns, create highly elastic pods and clusters, and keep our utilisation high? Is this a guaranteed climate win?

It all depends on what’s running in those pods. No matter how well-packed pods are, no matter how clean the multi-tenancy, no matter how admirable the utilisation metrics, if the workload isn’t useful, the whole thing is waste.

But surely no one would run useless applications? In fact, probably almost all of us do. Our industry has a horrible problem with zombie workloads; workloads whose value has ended, but which continue to shuffle digitally along, consuming resources. How bad is the problem? Anecdotally, every time I talk to people about zombie workloads, a couple of them come back to tell me they went away and turned off a few unneeded servers. More formally, one 2017 report surveyed 16,000 physical and virtual machines and concluded a quarter of them were doing no useful work.

Why are there servers hanging around if they aren’t doing useful work? People forget to turn them off.

Zombies are invisible, but they’re not innocuous. For one thing, they cost money. Even if you don’t care about the money (or you’re coasting along on a cloud provider’s free tier), those zombies are using electricity. 80% of the US’s energy comes from burning fossil fuels. Data centres are more likely to be renewable, but some data centre hotspots, like Virginia, are tied to fossil fuel supplies. Offsetting that carbon helps, but it’s not as good as not wasting energy in the first place.

The cloud makes almost everything better, and Kubernetes is great — so is Kubernetes zombie-proof? I don’t have hard evidence either way, but I have a suspicion it might actually be extra-vulnerable to zombies. The cloud makes it delightfully easy to provision compute capability, but it doesn’t help us remember to turn the things we provisioned off.

Finding the zombies

To mitigate this frictionless server creation, some organisations putting up barriers to provisioning — basically, making the cloud less cloud-y. A workload that never got created can’t turn into a zombie. These heavy governance systems make me sad; they often have the effect of suppressing learning, experimentation, and innovation. In my experience the best governance models are ones that make the right thing to do the easiest thing to do.

The zombie committee meeting

To manage the zombies which did sneak through provisioning gates, many organisations resort to manual zombie-hunting. One of my least-favourite techniques is the long meeting. I spent a tedious three hours with the CIO of a UK bank as his team combed through their cloud estate, trying to figure out what was actually being used and what was waste. (In case he’s reading this, I’m sure the CIO was really lovely, it was just the subject of the meeting which was boring.)

In hindsight, I was lucky that it was only a three-hour meeting; some organisations can work for years sifting through their estate.

Tags

Tagging workloads is slightly more effective, but only slightly. It requires several things to happen, and most of them are manual; people need to remember to tag their workloads on creation, people need to regularly go through the estate inspecting tags and deleting unneeded workloads, and everyone needs to agree on a scheme for the tags which is rich enough to allow waste workloads to be unambiguously identified.

It turns out some light interventions can have a big impact on zombie waste. For example, a colleague of mine implemented a simple lease system at a bank. Things could be provisioned easily but they were automatically removed after two weeks. Nothing had an opportunity to become a zombie, because it would get shut down first. I think of this as the Bladerunner model for managing systems. Unlike in Bladerunner, there were lots of advance notices of deletion, and plenty of opportunities to extend leases if servers were still wanted — but usually they weren’t wanted. Just shifting the burden of deletion from people to machines reduced the bank’s CPU usage by 50%.

Another helpful technique is the “switch the lights off at night” one. I used to semi-jokingly predict that in the future, turning infrastructure off would be the new ‘lights off’. Then several people pointed out they were actually doing it, right now (with the help of automation). One team reduced their compute costs by 37% by auto-shutting their servers down out of working hours.

Many of us (myself included!) instinctively avoid throwing our servers away. One root cause is a cognitive bias which causes us to value things more highly if we helped create them. We made it, it was effort, we want to hold onto our creation. Even if we’re not unduly attached to our servers, there is always the worry that we might need a server again later (digital packrat-ing).

Even shutting things down temporarily makes us uneasy. What if we turn something off, and then we discover we still need it, and we can’t get it back to the same state? Who hasn’t been burnt trying to reconstitute a system which was supposed to be completely documented and source controlled, but which never worked properly again after being brought down and up again?

This is where Kubernetes does start to help with de-zombification. The Kubernetes-native style of declarative infrastructure lends itself nicely to infrastructure-as-code and GitOps. In a GitOps model, our ability to reconstitute an application from source control is regularly exercised, so we know it works. Once we trust the git backup and have automation to apply it, our infrastructure is disposable. We can turn systems off overnight without being nervous about what will happen in the morning.

As well as being energy efficient, this is a great way to validate disaster recovery plans. I worked with a client with a DR requirement, so he was planning to stand up clusters in two different regions and failover between them. The problem was that their traffic was light, so two clusters would have been seriously wasteful. Although there was a requirement for DR, there was no ‘always-on’ requirement. Since a 20-minute downtime was acceptable, we were able to meet the DR requirement by showing that all the infrastructure was stored as code and regularly applied to the cluster using a GitOps tool.

Traffic monitoring has great potential to flag servers with no inbound or outbound traffic, particularly when combined with modern monitoring platforms. Almost always, if it’s not talking to anyone, it’s a zombie and we can shut it down. This kind of traffic monitoring isn’t a new capability, but it’s not yet widely used. I expect we’ll see more uptake of it in the coming years as monitoring becomes increasingly sophisticated. Multicloud management platforms are also getting more and more capable, and some are starting to include features for managing carbon footprint and costs.

More generally, adopting FinOps practices may help with zombie-taming. FinOps is about effectively managing cloud spend and ensuring costs flow to the right place, in real-time … but I like to think of it as “figuring out who in your organisation forgot to turn off their cloud.” Fine-grained visibility and accountability is a good cure for forgetfulness. (By supporting easy application-level cost allocation, FinOps may also help some of the Conway financial pressures which lead to the cluster fragmentation I discussed in Part I.)

So what should we be doing now? Audit your servers, clusters, and containers and dispose of ones that shouldn’t even be there. Implement automation to keep zombies from creeping in again; maybe a lease system to auto-delete servers, or timed power cycling would work for your use case. Invest in infrastructure-as-code so you can confidently pause systems, and invest in monitoring so you know what should be paused.

Although we have some technologies to help with zombie-hunting, this is an emerging area, and there’s lots still to do. If you’re contributing to the Kubernetes ecosystem, try and support workload visibility and disposability. Climate change is a hard problem, but avoiding waste is a first step we all should be able to take.

Bring your plan to the IBM Garage.
IBM Garage is built for moving faster, working smarter, and innovating in a way that lets you disrupt disruption.

Learn more at www.ibm.com/garage

How to Love Kubernetes and Not Wreck The Planet Part II: Revenge of the Zombies

Written by Holly K Cummins