简体   繁体   中英

Fast multiple search and replace in Python

For a single large text (~4GB) I need to search for ~1million phrases and replace them with complementary phrases. Both the raw text and the replacements can easily fit in memory. The naive solution will literally takes years to finish as a single replacement takes about a minute.

Naive solution:

for search, replace in replacements.iteritems():
    text = text.replace(search, replace)

The regex method using re.sub is x10 slower:

for search, replace in replacements.iteritems():
    text = re.sub(search, replace, text)

At any rate, this seems like a great place use Boyer-Moore string, or Aho-Corasick; but these methods as they are generally implemented only work for searching the string and not also replacing it.

Alternatively, any tool (outside of Python) that can do this quickly would also be appreciated.

Thanks!

Outside of python, sed is usually used for this sort of thing.

For example (taken from here ), to replace the word ugly with beautiful in the file sue.txt:

sed -i 's/ugly/beautiful/g' /home/bruno/old-friends/sue.txt

You haven't posted any profiling of your code, you should try some timings before you do any premature optimization. Searching and replacing text in a 4GB file is a computationally-intensive operation.

ALTERNATIVE Ask: should I be doing this at all? -

You discuss below doing an entire search and replace of the Wikipedia corpus in under 10ms. This rings some alarm bells as it doesn't sound like great design. Unless there's an obvious reason not to you should be modifying whatever code you use to present and/or load that to do the search and replace as a subset of the data is being loaded/viewed. It's unlikely you'll be doing many operations on the entire 4GB of data so restrict your search and replace operations to what you're actually working on. Additionally, your timing is still very imprecise because you don't know how big the file you're working on is.

On a final point, you note that:

the speedup has to be algorithmic not chaining millions of sed calls

But you indicated that the data you're working with was a "single large text (~4GB)" so there shouldn't be any chaning involved if I understand what you mean by that correctly.

UPDATE: Below you indicate that to do the operation on a ~4KB file (I'm assuming) takes 90s, this seems very strange to me - sed operations don't normally take anywhere close to that. If the file is actually 4MB (I'm hoping) then it should take 24 hours to evaluate (not ideal but probably acceptable?)

There's probably a better way than this:

re.sub('|'.join(replacements), lambda match: replacements[match.group()], text)

This does one search pass, but it's not a very efficient search. The re2 module may speed this up dramatically.

I had this use case as well, where I needed to do ~100,000 search and replace operations on the full text of Wikipedia. Using sed , awk , or perl would take years. I wasn't able to find any implementation of Aho-Corasick that did search-and-replace, so I wrote my own: fsed . This tool happens to be written in Python (so you can hack into the code if you like), but it's packaged up as a command line utility that runs like sed .

You can get it with:

pip install fsed

they are generally implemented only work for searching the string and not also replacing it

Perfect, that's exactly what you need. Searching with an ineffective algorithm in a 4G text is bad enough, but doing several replacing is probably even worse... you potentially have to move gigabytes of text to make space for the expansion/shrinking caused by the size difference of source and target text.

Just find the locations, then join the pieces with the replacements parts.

So a dumb analogy would be be "_".join( "abc".split(" ") ) , but of course you don't want to create copies the way split does.

Note: any reason to do this in python?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM