简体   繁体   中英

re.findall regex hangs or very slow

My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).

I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.

The following code works flawlessly when I feed a small, 100kb sized file into it:

import codecs
import re

outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
    outfile.write(result)
outfile.close()

When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.

I am very sure that the source of the problem is the (.*?) part of the regex, in combination with re.DOTALL. When I use a similar regex on smaller distances, the script will run fine and fast. My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn't be too much to handle, right? Am I maybe missing a more efficient solution?

Thanks in advance.

You are correct in thinking that using the sequence .* , which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .* , leading to a result known as catastrophic backtracking .

The usual solution is to replace the . with a character class that is much more specific, usually the production that you are trying to terminate the first .* with. Something like:

`[^\n]*(.*)`

so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing ), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\\n') )

Another workaround to this issue is adding a sane limit to the number of matched characters.

So instead of something like this:

[abc]*.*[def]*

You can limit it to 1-100 instances per character group.

[abc]{1,100}.{1,100}[def]{1,100}

This won't work for every situation, but in some cases it's an acceptable quickfix.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM