简体繁体 English

如何在Kubernetes中运行Kafka时管理页面缓存资源

[英]How to manage page cache resources when running Kafka in Kubernetes

原文 2018-02-04 15:52:59 6 1 apache-kafka/ kubernetes/ cgroups/ page-caching

I've been running Kafka on Kubernetes without any major issue for a while now; 我已经在Kubernetes上运行Kafka了一段时间没有任何重大问题; however, I recently introduced a cluster of Cassandra pods and started having performance problems with Kafka. 然而，我最近推出了一组Cassandra pods并开始遇到Kafka的性能问题。

Even though Cassandra doesn't use page cache like Kafka does, it does make frequent writes to disk, which presumably effects the kernel's underlying cache. 尽管Cassandra不像Kafka那样使用页面缓存，但它确实会频繁写入磁盘，这可能会影响内核的底层缓存。

I understand that Kubernetes pods are managing memory resources through cgroups, which can be configured by setting memory requests and limits in Kubernetes, but I've noticed that Cassandra's utilization of page cache can increase the number of page faults in my Kafka pods even when they don't seem to be competing for resources (ie, there's memory available on their nodes). 我知道Kubernetes pod正在通过cgroup管理内存资源，可以通过在Kubernetes中设置内存请求和限制来配置，但我注意到Cassandra利用页面缓存可以增加我的Kafka pod中的页面错误数量，即使它们是似乎没有竞争资源（即，他们的节点上有可用的内存）。

In Kafka more page faults leads to more writes to disk, which hamper the benefits of sequential IO and compromise disk performance. 在Kafka中，更多页面错误会导致更多磁盘写入，这会妨碍顺序IO的优势并降低磁盘性能。 If you use something like AWS's EBS volumes, this will eventually deplete your burst balance and eventually cause catastrophic failures across your cluster. 如果您使用AWS的EBS卷，这将最终耗尽您的突发平衡并最终导致整个群集的灾难性故障。

My question is, is it possible to isolate page cache resources in Kubernetes or somehow let the kernel know that pages owned by my Kafka pods should be kept in the cache longer than those in my Cassandra pods? 我的问题是，是否可以隔离Kubernetes中的页面缓存资源，或者以某种方式让内核知道我的Kafka pod所拥有的页面应该比我的Cassandra pod中的那些更长时间保存在缓存中？

1 个解决方案

I thought this was an interesting question, so this is a posting of some findings from a bit of digging. 我认为这是一个有趣的问题，所以这是一些挖掘的一些发现。

Best guess: there is no way with k8s OOB to do this, but enough tooling is available such that it could be a fruitful area for research and development of a tuning and policy application that could be deployed as a DaemonSet. 最好的猜测：k8s OOB没有办法做到这一点，但是有足够的工具可用，这对于研究和开发可以部署为DaemonSet的调优和策略应用程序来说可能是一个富有成效的领域。

Findings: 发现：

Applications can use the fadvise() system call to provide guidance to the kernel regarding which file-backed pages are needed by the application and which are not and can be reclaimed. 应用程序可以使用fadvise（）系统调用来向内核提供有关应用程序需要哪些文件支持页面以及哪些页面不可以回收的指导。

http://man7.org/linux/man-pages/man2/posix_fadvise.2.html http://man7.org/linux/man-pages/man2/posix_fadvise.2.html

Applications can also use O_DIRECT to attempt to avoid the use of page cache when doing IO: 应用程序还可以使用O_DIRECT在执行IO时尝试避免使用页面缓存：

https://lwn.net/Articles/457667/ https://lwn.net/Articles/457667/

There is some indication that Cassandra already uses fadvise in a way that attempts to optimize for reducing its page cache footprint: 有一些迹象表明Cassandra已经尝试优化以减少其页面缓存占用空间的方式使用fadvise：

http://grokbase.com/t/cassandra/commits/122qha309v/jira-created-cassandra-3948-sequentialwriter-doesnt-fsync-before-posix-fadvise http://grokbase.com/t/cassandra/commits/122qha309v/jira-created-cassandra-3948-sequentialwriter-doesnt-fsync-before-posix-fadvise

There is also some recent (Jan 2017) research from Samsung patching Cassandra and fadvise in the kernel to better utilize multi-stream SSDs: 最近还有一些（2017年1月）的研究来自三星补丁Cassandra和内核中的fadvise以更好地利用多流SSD：

http://www.samsung.com/us/labs/pdfs/collateral/Multi-stream_Cassandra_Whitepaper_Final.pdf http://www.samsung.com/us/labs/pdfs/collateral/Multi-stream_Cassandra_Whitepaper_Final.pdf

Kafka is page cache architecture aware, though it doesn't appear to use fadvise directly. Kafka可以识别页面缓存架构，但它似乎没有直接使用fadvise。 The knobs available from the kernel are sufficient for tuning Kafka on a dedicated host: 内核提供的旋钮足以在专用主机上调整Kafka：

vm.dirty* for guidance on when to get written-to (dirty) pages back onto disk vm.dirty *指导何时将写入（脏）页面重新写入磁盘
vm.vfs_cache_pressure for guidance on how aggressive to be in using RAM for page cache vm.vfs_cache_pressure指导有关使用RAM进行页面缓存的积极程度

Support in the kernel for device-specific writeback threads goes way back to the 2.6 days: 内核对特定于设备的回写线程的支持可以追溯到2.6天：

https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics

Cgroups v1 and v2 focus on pid-based IO throttling, not file-based cache tuning: Cgroups v1和v2专注于基于pid的IO限制，而不是基于文件的缓存调优：

https://andrestc.com/post/cgroups-io/ https://andrestc.com/post/cgroups-io/

That said, the old linux-ftools set of utilities has a simple example of a command-line knob for use of fadvise on specific files: 也就是说，旧的linux-ftools实用程序集有一个简单的命令行旋钮示例，用于在特定文件上使用fadvise：

https://github.com/david415/linux-ftools https://github.com/david415/linux-ftools

So there's enough there. 那里就够了。 Given specific kafka and cassandra workloads (eg read-heavy vs write-heavy), specific prioritizations (kafka over cassandra or vice versa) and specific IO configurations (dedicated vs shared devices), one could emerge with a specific tuning model, and those could be generalized into a policy model. 给定特定的kafka和cassandra工作负载（例如读取重量级和写入量级），特定优先级（kafka over cassandra或反之亦然）和特定IO配置（专用与共享设备），可以使用特定的调优模型，这些可以被概括为政策模型。