简体   繁体   English

从词汇表替换字符串的有效方法-Python

[英]Efficient way for replace strings from a vocabulary - Python

I have a vocabulary of phrases and I want to replace words of another file by these words. 我有一个短语词汇,我想用这些单词代替另一个文件的单词。 For example, I have the following vocabulary: 例如,我有以下词汇:

United States, New York 美国,纽约

and I want to replace the following file: 并且我想替换以下文件:

"I work for New York but I don't even live at the United States" “我在纽约工作,但我什至都不住在美国”

To this: 对此:

"I work for New_York but I don't even live at the United_States" “我为纽约工作,但我什至都不住在美国”

Currently I'm doing at this way: 目前,我正在以这种方式进行操作:

import os

def _check_files_and_write_phrases(docs, worker_num):
    print("worker ", worker_num," started!")
    for i, file in enumerate(docs):
        file_path = DOCS_FOLDER + file
        with open(file_path) as f:
            text = f.read()
            for phrase in phrases:
                text = text.replace(phrase, phrase.replace(' ','_'))
            new_doc = PHRASES_DOCS_FOLDER + file[:-4] + '_phrases.txt'
            with open(new_doc, 'w') as nf:
                nf.write(text)

    print("job done on worker ", worker_num)


docs = os.listdir(DOCS_FOLDER)

import threading

threads = []
for i in range(1, 11):
    print(i)
    start = int((len(docs)/10) * (i - 1))
    end = int((len(docs)/10) * (i))
    print(start,end)
    if i != 10:
        t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:end], i, ))
    else:
        t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:], i, ))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

print("all workers finished!")

But it's way too slow! 但这太慢了! I thought that threads would do the job but I was wrong... 我以为线程可以胜任工作,但我错了……

Is there another efficient way of doing this? 有另一种有效的方法吗?

All of the phrases could be replaced using a single re.sub() call which could be pre-compiled to further speed things up a bit: 可以使用单个re.sub()调用替换所有短语,该调用可以预先编译以进一步加快处理速度:

import re

phrases = {"United States":"United_States", "New York":"New_York"}
re_replace = re.compile(r'\b({})\b'.format('|'.join(re.escape(phrase) for phrase in phrases.keys())))

def _check_files_and_write_phrases(docs, worker_num):
    print("worker {} started!".format(worker_num))

    for i, filename in enumerate(docs):
        file_path = DOCS_FOLDER + filename

        with open(file_path) as f:
            text = f.read()
            text = re_replace.sub(lambda x: phrases[x.group(1)], text)
            new_doc = PHRASES_DOCS_FOLDER + filename[:-4] + '_phrases.txt'

            with open(new_doc, 'w') as nf:
                nf.write(text)

    print("job done on worker ", worker_num)

This first creates a regular expression to search on as follows based on a dictionary of phrases: 这首先会创建一个正则表达式以基于短语字典进行如下搜索:

\b(United\ States|New\ York)\b

The re.sub() function then uses the phrases dictionary to look up the required phrase replacement. 然后, re.sub()函数使用phrases字典来查找所需的短语替换。 It takes two parameters, the replacement and the original text. 它带有两个参数,即替换和原始文本。 The replacement can either be a fixed string, or in this case a function is used. 替换可以是固定字符串,也可以是函数。 The function takes a single argument being the matched object, and returns the replacement text. 该函数采用单个参数作为匹配对象,并返回替换文本。 A lambda function is used to do this, it simply looks up the match object in phrases dictionary. lambda函数用于执行此操作,它只是在phrases字典中查找匹配对象。

Instead of doing a dictionary lookup, it could just use a replace() here but the pre-calculated replacement text should be faster. 除了可以进行字典查找之外,它可以在此处使用replace() ,但是预先计算的替换文本应该更快。 The \\b is added to only make replacements which are on word boundaries, so for example MYNew York would be skipped. 添加\\b仅用于替换单词边界上的替换,因此例如将跳过MYNew York Adding flags=re.I to the re.compile() could be used to make the search case insensitive if needed. 如果需要,可以将flags=re.I添加到re.compile()来使搜索不区分大小写。

Try to change the for loop to replace only phrases that exist in the text: 尝试更改for循环以仅替换文本中存在的短语:

for phrase in set(phrases).intersection(text.split()):
...

Try it with and without the threading. 尝试使用有无线程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM