[英]Return the exact lines of a Huge file after pattern matching without using FOR in Python3
I am new to Python.我是 Python 的新手。 My problem here is that: I want to match a pattern against a large file and return matching lines(not just the matched string) from it.
我的问题是:我想将一个模式与一个大文件进行匹配,并从中返回匹配的行(不仅仅是匹配的字符串)。 I DO NOT want a FOR loop for this as my file is huge.
我不想为此使用 FOR 循环,因为我的文件很大。 I am using mmap for reading the file.
我正在使用mmap来读取文件。
in the above file, if I search for bhuvi , I should get 2 rows, bhuvi and bhuvi Kumar在上面的文件中,如果我搜索bhuvi ,我应该得到 2 行, bhuvi和bhuvi Kumar
I used re.findall() for this, but it just returns the substrings, not the whole lines.我为此使用了 re.findall(),但它只返回子字符串,而不是整行。
Can someone please suggest what I can do here?有人可以建议我在这里做什么吗?
If your input file is huge, you cannot use readlines
, but nothing prevents you from reading one line in a loop.如果输入文件是巨大的,你不能使用
readlines
,但没有阻止你读一行在一个循环。
As the file object is iterable, you can write the loop as:由于文件对象是可迭代的,您可以将循环编写为:
for line in fh:
and process the content of the input line inside the loop.并在循环内处理输入行的内容。
The file size is not important, as you do not attempt to read all lines at once.文件大小并不重要,因为您不会尝试一次读取所有行。
To check for presence of your string ( bhuvi
) in the line use re.search
, not re.findall
.要检查该行中是否存在您的字符串 (
bhuvi
),请使用re.search
,而不是re.findall
。 Actually you don't need any list of matches, it is enough to find a single match (it works quicker).其实你不需要匹配任何名单,就足以找到一个匹配(它的工作原理更快)。
Below you have an example program ( Python 3.7 ), writing the lines contaning your string, along with the line number:下面是一个示例程序( Python 3.7 ),编写包含字符串的行以及行号:
import re
cnt = 0
with open('input.txt') as fh:
for line in fh:
line = line.rstrip()
cnt += 1
if re.search('bhuvi', line):
print(f'{cnt}: {line}')
Note that I used rstrip()
to remove the trailing newline, if any.请注意,我使用
rstrip()
删除了尾随的换行符(如果有)。
You wrote that the file to check is huge .您写道要检查的文件很大。 So there is a risk that if you try to read it whole into the computer memory, the program runs out of memory.
因此,如果您尝试将其全部读入计算机内存,则存在程序内存不足的风险。
In such a case you would have to read the file chunk by chunk and perform search in each chunk separately.在这种情况下,您必须逐块读取文件并分别在每个块中执行搜索。
There is also a risk that a row with the text you are looking for will be partially read in one chunk and the rest in the next, so you have to take some measure to avoid this in your program.还有一种风险是,包含您正在查找的文本的行将在一个块中部分读取,其余部分在下一个块中读取,因此您必须采取一些措施在程序中避免这种情况。
On the other hand, if there is no other way but using mmap , try something like re.finditer(r'[^\\n]*bhuvi[^\\n]*', map)
, ie create an iterator looking for:另一方面,如果除了使用mmap没有其他方法,请尝试类似
re.finditer(r'[^\\n]*bhuvi[^\\n]*', map)
,即创建一个迭代器寻找:
This way the match object returned by the iterator will match the whole line , not your string alone.这样迭代器返回的匹配对象将匹配整行,而不是单独的字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.