简体   繁体   English

在Linux中快速查找巨大文件中最后N次出现的字符串

[英]Quickly find last N occurrences of string in huge file in Linux

I'm working with an application that generates gigantic log files (2.5GB per day). 我正在使用一个生成巨大日志文件的应用程序(每天2.5GB)。 Occasionally, I need to gather info about the state of the app by searching through the log for select strings. 有时,我需要通过在日志中搜索选择字符串来收集有关应用程序状态的信息。

This is running on a small CentOS Linux system and since it's a production environment I want to minimize the CPU load of this type of search. 这是在一个小型CentOS Linux系统上运行的,因为它是一个生产环境,我想尽量减少这种搜索的CPU负载。

What is the most efficient way to find the last 50 occurrences of a string in a large file? 查找大文件中最后50次出现的字符串的最有效方法是什么? The fastest I was able to come up with is this: 我能想到的最快的是:

tac file.log | grep 'some string' -m50 | tac

Is that as fast as I'm going to get or are there better options? 那是我能够获得的还是那么快还是有更好的选择?

Also, WHY is this fast? 另外,为什么这么快? I expected the "tac" to reverse the whole file resulting in slower performance, but that does not appear to be the case. 我希望“tac”能够反转整个文件,导致性能降低,但事实并非如此。

Update: 更新:

An example scenario: say the application logs statistics about its memory utilization every 5 minutes. 示例场景:假设应用程序每5分钟记录一次有关其内存利用率的统计信息。 If I wanted to see the trends over the past hour, I would currently do something like this: 如果我想看看过去一小时的趋势,我现在会做这样的事情:

tac file.log | grep 'Memory' -m12 | tac

What you have is good. 你有什么是好的。 The reason tac isn't slow is that it doesn't need to read the whole file and reverse it. tac不慢的原因是它不需要读取整个文件并将其反转。 Instead, it can seek to the last byte of the file and read backward from there. 相反,它可以寻找文件的最后一个字节并从那里向后读。 And once your grep finds enough lines, it will stop, SIGPIPE will be raised in the first tac , and the remainder of the input file need not be read at all. 一旦你的grep找到足够的行,它就会停止, SIGPIPE会在第一个tac ,而输入文件的其余部分根本不需要读取。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM