Recently few of my colleagues got hit with OOMKilled issues for their application PODs when migrating from OpenShift cluster with smaller worker nodes (in terms of CPUs and Memory) to another cluster with bigger worker nodes (more CPUs and Memory).
The application was working absolutely fine when running on smaller cluster however many of the application PODs failed to start and threw OOMKilled errors when migrated to the bigger cluster. This is when some of us got pulled into to help analyse the issue w.r.to OpenShift and container runtime.
This article is a summary of the work with the hope that it might be useful for some of you.
POD OOMKilled Basics
Kubernetes provides a way to specify minimum and maximum resource requirements for containers. Here is an example of a POD specifying minimum (50M) and maximum (200M) memory requirements.
- name: memory-demo
Note that you can either specify minimum (requests) or maximum (limits) or both.
When the POD has a memory ‘limit’ (maximum) defined and if the POD memory usage crosses the specified limit, the POD will get killed, and the status will be reported as OOMKilled. Note that, this happens despite the node having enough free memory. More details available in the following article.
Now in our case the container memory limit settings worked in smaller clusters but it failed on bigger clusters. Other than the size of the cluster nodes, nothing changed. The containers failed to even start when moved to the new cluster with bigger nodes. So what was going wrong?
Container Memory Limits and CPUs
We started with a basic experiment. A single container was run on different nodes, each with different number of CPUs. And we found that the memory usage of the same container showed an increase with increasing number of CPUs in the node. The increase was more significant on nodes with very large number of CPUs (100+).
So what contributes to this increase? Remember, the container (workload) is the same and there is not change to application parameters.
Let’s dig deeper and understand what contributes to container memory usage. There are three aspects which contributes to the memory usage of a container:
- per-cpu object caches used by the memory cgroup controller
- per-cpu kernel data structures
- process memory allocation
In summary container memory usage includes application memory usage and kernel memory usage. The application memory usage is largely independent of the number of CPUs however kernel memory usage has a directly relationship to the number of CPUs. Consequently if a container’s memory limit was sized on a node with lesser number of CPUs, it won’t be sufficient when the same container is run on a node with larger number of CPUs. This is due to the additional kernel memory that gets allocated because of more CPUs .
This is more pronounced on CPU architectures with large page sizes. For example on Power CPU architecture which uses 64K page size as compared to Intel which uses 4K page size, the increase in container memory usage with large number of CPUs is more pronounced.
Following are the available solution to handle the OOMKilled issues
- You’ll need to size the container workload for different node configurations when using memory limits. Unfortunately there is no formula that can be applied to calculate the rate of increase in container memory usage with increasing number of cpus on the node.
- One of the kernel tuneables that can help reduce the memory usage of containers is slub_max_order. A value of 0 (default is 3) can help bring down the overall memory usage of the container but can have negative performance implication for certain workloads. It’s advisable to benchmark the container workload with this tuneable.
There is a new memory cgroup controller in the works which shows lots of potential — https://www.phoronix.com/scan.php?page=news_item&px=Slab-Controller-Improvements-V6