简体   繁体   English

在Python中快速进行多重搜索和替换

[英]Fast multiple search and replace in Python

For a single large text (~4GB) I need to search for ~1million phrases and replace them with complementary phrases. 对于单个大文本(约4GB),我需要搜索约100万个短语并将其替换为补充短语。 Both the raw text and the replacements can easily fit in memory. 原始文本和替换文本都可以轻松地存储在内存中。 The naive solution will literally takes years to finish as a single replacement takes about a minute. 天真的解决方案从字面上看将需要数年才能完成,因为一次更换大约需要一分钟。

Naive solution: 天真的解决方案:

for search, replace in replacements.iteritems():
    text = text.replace(search, replace)

The regex method using re.sub is x10 slower: 使用re.sub的regex方法要慢10 re.sub

for search, replace in replacements.iteritems():
    text = re.sub(search, replace, text)

At any rate, this seems like a great place use Boyer-Moore string, or Aho-Corasick; 无论如何,这似乎是使用Boyer-Moore弦线或Aho-Corasick的好地方; but these methods as they are generally implemented only work for searching the string and not also replacing it. 但通常实现的这些方法仅用于搜索字符串,而不替换字符串。

Alternatively, any tool (outside of Python) that can do this quickly would also be appreciated. 另外,任何可以快速完成此操作的工具(Python之外)也将受到赞赏。

Thanks! 谢谢!

Outside of python, sed is usually used for this sort of thing. 在python之外, sed通常用于这种事情。

For example (taken from here ), to replace the word ugly with beautiful in the file sue.txt: 例如(从此处获取 ),将sue.txt文件中的ugly替换为beautiful:

sed -i 's/ugly/beautiful/g' /home/bruno/old-friends/sue.txt

You haven't posted any profiling of your code, you should try some timings before you do any premature optimization. 您尚未发布任何代码配置文件,应在进行任何过早的优化之前尝试一些计时。 Searching and replacing text in a 4GB file is a computationally-intensive operation. 搜索和替换4GB文件的文本一项计算量大的操作。

ALTERNATIVE Ask: should I be doing this at all? 替代要求:我应该完全这样做吗? - -

You discuss below doing an entire search and replace of the Wikipedia corpus in under 10ms. 您将在下面讨论在10毫秒内完成整个搜索和替换Wikipedia语料库的过程。 This rings some alarm bells as it doesn't sound like great design. 这听起来像是很棒的设计,但会敲响一些警钟。 Unless there's an obvious reason not to you should be modifying whatever code you use to present and/or load that to do the search and replace as a subset of the data is being loaded/viewed. 除非有明显的理由,否则您不应该修改用于显示和/或加载以执行搜索和替换的任何代码,因为正在加载/查看数据的子集。 It's unlikely you'll be doing many operations on the entire 4GB of data so restrict your search and replace operations to what you're actually working on. 您不太可能会对整个4GB的数据执行许多操作,因此将搜索和替换操作限制为实际正在处理的内容。 Additionally, your timing is still very imprecise because you don't know how big the file you're working on is. 此外,您的时间安排仍然非常不精确,因为您不知道要处理的文件有多大。

On a final point, you note that: 最后一点,您注意到:

the speedup has to be algorithmic not chaining millions of sed calls 加速必须是算法,不能链接数百万个sed调用

But you indicated that the data you're working with was a "single large text (~4GB)" so there shouldn't be any chaning involved if I understand what you mean by that correctly. 但是您指出您正在使用的数据是“单个大文本(〜4GB)”,因此,如果我正确理解您的意思,则不应该涉及任何更改。

UPDATE: Below you indicate that to do the operation on a ~4KB file (I'm assuming) takes 90s, this seems very strange to me - sed operations don't normally take anywhere close to that. 更新:在下面您指示对〜4KB文件(我假设)执行操作需要90秒钟,这对我来说似乎很奇怪-sed操作通常不需要花费任何时间。 If the file is actually 4MB (I'm hoping) then it should take 24 hours to evaluate (not ideal but probably acceptable?) 如果文件实际上是4MB(我希望是),则需要24小时才能评估(不理想,但可能可以接受吗?)

There's probably a better way than this: 可能有比这更好的方法:

re.sub('|'.join(replacements), lambda match: replacements[match.group()], text)

This does one search pass, but it's not a very efficient search. 这会进行一次搜索,但这不是非常有效的搜索。 The re2 module may speed this up dramatically. re2模块可能会大大加快速度。

I had this use case as well, where I needed to do ~100,000 search and replace operations on the full text of Wikipedia. 我也有这个用例,我需要在Wikipedia全文上进行约100,000个搜索和替换操作。 Using sed , awk , or perl would take years. 使用sedawkperl将花费数年。 I wasn't able to find any implementation of Aho-Corasick that did search-and-replace, so I wrote my own: fsed . 我找不到能进行搜索和替换的Aho-Corasick的任何实现,因此我编写了自己的文件: fsed This tool happens to be written in Python (so you can hack into the code if you like), but it's packaged up as a command line utility that runs like sed . 该工具恰巧是用Python编写的(因此,您可以根据需要破解代码),但是它被打包为一个命令行工具,运行方式类似于sed

You can get it with: 您可以通过以下方式获得它:

pip install fsed

they are generally implemented only work for searching the string and not also replacing it 它们通常仅用于搜索字符串而不替换它

Perfect, that's exactly what you need. 完美,这正是您所需要的。 Searching with an ineffective algorithm in a 4G text is bad enough, but doing several replacing is probably even worse... you potentially have to move gigabytes of text to make space for the expansion/shrinking caused by the size difference of source and target text. 在4G文本中使用无效算法进行搜索已经足够糟糕,但是进行多次替换可能会更糟...您可能不得不移动数千兆字节的文本,以便为源文本和目标文本的大小差异导致的扩展/缩小留出空间。

Just find the locations, then join the pieces with the replacements parts. 只需找到位置,然后将零件与替换零件连接起来即可。

So a dumb analogy would be be "_".join( "abc".split(" ") ) , but of course you don't want to create copies the way split does. 因此,一个愚蠢的类比是"_".join( "abc".split(" ") ) ,但是您当然不想像split那样创建副本。

Note: any reason to do this in python? 注意:是否有任何理由在python中执行此操作?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM