简体   繁体   English

在python 3.x中有效地搜索多个文件的关键字的最佳方法?

[英]Best way to search multiple files for keywords efficiently in python 3.x?

Sorry if this has been asked before, but i didn't seem to find a solution to my problem.抱歉,如果之前有人问过这个问题,但我似乎没有找到解决问题的方法。

I have around 500 text files, each around 5-6 kB in size.我有大约 500 个文本文件,每个文件大小约为 5-6 kB。 I need to search every file and check if a particular keyword is present in it, and print the details of every file in which the keyword is present.我需要搜索每个文件并检查其中是否存在特定关键字,并打印存在关键字的每个文件的详细信息。

I can do this using我可以使用

for files in glob.glob("*"):
      and then search for the keyword inside the file

I'm sure this isn't the most efficient way to do this.我确信这不是最有效的方法。 What better way is there?有什么更好的方法吗?

If you want all *.c files in your directory which include the stdio.h file, you could do如果您想要目录中包含stdio.h文件的所有*.c文件,您可以这样做

grep "stdio\.h" *.c

(note - edited to respond to @Wooble's comment.) (注意 - 编辑以回应@Wooble 的评论。)

The result might look like this结果可能如下所示

myfile.c: #include <stdio.h>
thatFile.c: #include <stdio.h>

etc.等。

If you want to see "context" (eg the line before and after), use the C flag:如果您想查看“上下文”(例如之前和之后的行),请使用C标志:

grep -C1 "(void)" *.c

result:结果:

scanline.c-
scanline.c:int main(void){
scanline.c-  double sum=0;
--
tour.c-
tour.c:int main(void) {
tour.c-int *bitMap;

etc.等。

I think this should work well for you.我认为这对你来说应该很有效。

Again, addressing @Wooble's other point: if you really want to do this with Python, you could use再次,解决@Wooble 的另一点:如果你真的想用 Python 做到这一点,你可以使用

import subprocess

p = subprocess.Popen('grep stdio *.c', shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in p.stdout.readlines():
    print line,
retval = p.wait()

Now you have access to the output "in Python" and can do clever things with the lines as you see fit.现在您可以访问“在 Python 中”的输出,并且可以按照您认为合适的方式使用这些行来做一些聪明的事情。

grep isn't always an option. grep 并不总是一种选择。 If you're writing a python script to be used in a work environment, and that environment happens to be primarily Windows, then you're biting off dependency management for your team when you tell them they need to have grep installed.如果您正在编写要在工作环境中使用的 Python 脚本,而该环境恰好主要是 Windows,那么当您告诉他们需要安装 grep 时,您就是在为团队进行依赖管理。 That's no good.那不好。

I haven't found anything faster than glob for searching the filesystem, but there are ways to speed up searching through your files.我没有找到比 glob 更快的搜索文件系统的方法,但是有一些方法可以加快搜索文件的速度。 For example, if you know your files are going to have a lot of short lines (like json or xml files for example) you could skip looking at any lines that are shorter than your smallest keyword.例如,如果您知道您的文件将有很多短行(例如 json 或 xml 文件),您可以跳过查看任何比最小关键字短的行。

the regex library in python is pretty slow, as well. python 中的正则表达式库也很慢。 It is much faster, by an order of magnitude or more, to search each line one character at a time to see if line[ len(str_to_search_for) : ] == str_to_search_for than to run a regex on each line.一次搜索每一行一个字符以查看line[ len(str_to_search_for) : ] == str_to_search_for比在每一行上运行正则表达式要快得多,要快一个数量级或更多。

I've been doing quite a bit of searching on the filesystem lately, and for a data set of 500GB my searches started at about 8 hours and I managed to get them down to 3 using simple techniques like these.我最近在文件系统上做了很多搜索,对于 500GB 的数据集,我的搜索从大约 8 小时开始,我设法使用这些简单的技术将它们减少到 3。 It takes some time because you are tailoring your strategy to your use case, but you can squeeze a lot of speed out of python if you do so.这需要一些时间,因为您正在为您的用例定制您的策略,但是如果您这样做,您可以从 python 中榨取很多速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM