简体   繁体   English

Python-如何在大文件中搜索字符串

[英]Python - How to search a string in a large file

I have a large file that can have strings like file_+0.txt, file_[]1.txt, file_~8.txt etc. 我有一个大文件,可以包含诸如file_+0.txt, file_[]1.txt, file_~8.txtfile_+0.txt, file_[]1.txt, file_~8.txt

I want to find the missing files_*.txt until a certain number. 我想找到丢失的files_*.txt直到一定数量。

For example if I give the below file and a number 5, it should tell that the missing ones are 1 and 4 例如,如果我给下面的文件和数字5,它应该告诉我们丢失的是1 and 4

asdffile_[0.txtsadfe
asqwffile_~2.txtsafwe
awedffile_[]2.txtsdfwe
qwefile_*0.txtsade
zsffile_+3.txtsadwe

I wrote a Python script to which I can give the file path and a number and it will give me all file names that are missing until that number. 我写了一个Python脚本,可以给它提供文件路径和一个数字,它将为我提供在该数字之前丢失的所有文件名。

My program works for small files. 我的程序适用于小文件。 But when I give a large file (12MB) that can have file numbers until 10000, it just hangs. 但是,当我提供一个大文件(12MB)时,该文件的文件号可以达到10000,直到挂起。

Here is my current Python code 这是我当前的Python代码

#! /usr/bin/env/python
import mmap
import re

def main():
    filePath = input("Enter file path: ")
    endFileNum = input("Enter end file number: ")
    print(filePath)
    print(endFileNum)
    filesMissing = []
    filesPresent = []
    f = open(filePath, 'rb', 0)
    s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    for x in range(int(endFileNum)):
        myRegex = r'(.*)file(.*)' + re.escape(str(x)) + r'\.txt'
        myRegex = bytes(myRegex, 'utf-8')
        if re.search(myRegex, s):
            filesPresent.append(x)
        else:
            filesMissing.append(x)
    #print(filesPresent)
    print(filesMissing)

if __name__ == "__main__":
    main()

Output hangs when I give a 12MB file which can have files from 0 to 9999 当我提供一个12MB的文件时,输出会挂起,该文件的文件范围从0到9999

$python findFileNumbers.py
Enter file path: abc.log
Enter end file number: 10000

Output for a small file (same as the above example) 输出一个小文件(与上面的示例相同)

$python findFileNumbers.py
Enter file path: sample.log
Enter end file number: 5
[0, 2, 3]
[1, 4]
  1. How can I make this work for big files? 我该如何处理大文件?
  2. Is there a better way I can get these results instead of a Python script? 有没有比Python脚本更好的方法来获得这些结果?

Thanks in advance! 提前致谢!

first collect the existing ones in a set and then look for the missing ones. 首先收集一组现有的,然后寻找缺失的。

my_regex = re.compile('.*file.*(\d+)\.txt.*')
present_ones = set()
for line in open(filepath):
    match = my_regex.match(line)
    if match:
       present_ones.add(int(match.group(1)))
for num in range(...):
    if num not in present_ones:
        print("Missing" + num)

The reason yours hangs because you are going through the entire file for each number. 之所以挂起,是因为您要遍历每个数字的整个文件。 ie 12MB * 10000 = 120GB The script is going through 120GB and so it hangs even if you have it in mmap. 即12MB * 10000 = 120GB脚本正在处理120GB,因此即使您在mmap中也将其挂起。

I would suggest that you simply read through the input file line by line and parse each of the lines for its file number. 我建议您简单地逐行阅读输入文件并解析每一行的文件号。 Then use that file number as an index into a boolean array set False initially. 然后使用该文件号作为初始为False的布尔数组的索引。

You don't do any processing that requires the file to be in memory. 您无需进行任何要求文件在内存中的处理。 This approach will work for very large files. 这种方法适用于非常大的文件。

#~ import mmap
import re
import numpy as np

def main():
    #~ filePath = input("Enter file path: ")
    filePath = 'filenames.txt'
    #~ endFileNum = input("Enter end file number: ")
    endFileNum = 5
    print(filePath)
    print(endFileNum)
    found = np.zeros(1+endFileNum, dtype=bool)
    patt = re.compile(r'[^\d]+(\d+)')
    with open(filePath) as f:
        for line in f.readlines():
            r = patt.search(line).groups(0)[0]
            if r:
                found[int(r)]=True
    print (found)

    #~ filesMissing = []
    #~ filesPresent = []
    #~ files = np.zeros[endFileNum, dtype=bool]
    #~ f = open(filePath, 'rb', 0)
    #~ s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    #~ for x in range(int(endFileNum)):
        #~ myRegex = r'(.*)file(.*)' + re.escape(str(x)) + r'\.txt'
        #~ myRegex = bytes(myRegex, 'utf-8')
        #~ if re.search(myRegex, s):
            #~ filesPresent.append(x)
        #~ else:
            #~ filesMissing.append(x)
    #print(filesPresent)
    #~ print(filesMissing)

if __name__ == "__main__":
    main()

This produces the following result from which your filesPresent and filesMissing are easily recovered. 这将产生以下结果,可filesMissing轻松恢复filesPresentfilesMissing

filenames.txt
5
[ True False  True  True False False]

Let's take a look at what you are actually doing here: 让我们看看您在这里实际上在做什么:

  1. Memory map the file. 内存映射文件。
  2. For each number 对于每个数字

    a. 一种。 Compile a regular expression for that number. 为该数字编译一个正则表达式。
    b. b。 Search for the regular expression in the entire file. 在整个文件中搜索正则表达式。

This is very inefficient for large numbers. 对于大量而言,这是非常低效的。 While memory mapping gives you a string-like interface to the file, it is not magic. 虽然内存映射为您提供了一个类似于文件的字符串界面 ,但这并不是魔术。 You still have load chunks of the file to move around within it. 您仍然有文件的加载块在其中移动。 At the same time, you are making a pass, potentially over the entire file, for each regex. 同时,您可能正在为每个正则表达式传递整个文件。 And regex matching is expensive as well. 正则表达式匹配也很昂贵。

The solution here would be to make a single pass through the file, line by line. 解决方案是逐行通过文件。 You should pre-compile the regular expression instead of compiling it once per number if you have a large number to search for. 如果要搜索的数字很大,则应预编译正则表达式,而不是每个数字编译一次。 To get all the numbers in a single pass, you could make a set of all the numbers up to the one you want, called "missing", and an empty set called "found". 要一次获得所有数字,您可以将一set所有数字组合成所需的数字,称为“缺失”,而空集合set称为“找到”。 Whenever you encounter a line with a number, you would move the number from "missing" to "found". 每当遇到带有数字的行时,您都将数字从“丢失”移动到“找到”。

Here is a sample implementation: 这是一个示例实现:

filePath = input("Enter file path: ")
endFileNum = int(input("Enter end file number: "))
missing = set(range(endFileNum))
found = set()
regex = re.compile(r'file_.*?(\d+)\.txt')
with open(filePath) as file:
    for line in file:
        for match in regex.finditer(line)
            num = int(match.groups(1))
            if num < endFileNum:
                found.add(num)
missing -= found

Notice that the regular expression uses the reluctant quantifier .*? 注意,正则表达式使用了勉强的量词 .*? after file_ . file_之后。 This will match as few characters as possible before looking for a digit. 在寻找数字之前,这将匹配尽可能少的字符。 If you have the default greedy quantifier of .* , multiple numbers on one line would match only the last one. 如果您具有默认的.*贪婪量词,则一行上的多个数字将仅匹配最后一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM