简体   繁体   English

替换长列表Python中的元素

[英]Replacing elements in long list Python

I am trying to replace a number of elements (around 40k) from a much larger list (3 million elements) with a tag.我试图用标签替换更大的列表(300 万个元素)中的多个元素(大约 40k)。 The code below works however it is extremely slow.下面的代码有效,但速度非常慢。

def UNKWords(words):
    words = Counter(words)
    wordsToBeReplaced = []
    for key, value in words.items():
        if(value == 1):
            wordsToBeReplaced.append(key)
    return wordsToBeReplaced

wordsToBeReplaced = UNKWords(trainingWords)

replacedWordsList = ["<UNK>" if word in wordsToBeReplaced else word for word in trainingWords]

Is there a more efficient way of replacing elements in such a large list?有没有更有效的方法来替换如此大列表中的元素?

You can make wordsToBeReplaced a set instead so that lookups can be done in constant time on average rather than linear time:您可以将wordsToBeReplaced一个集合,以便可以在平均恒定时间内而不是线性时间内完成查找:

def UNKWords(words):
    return {word for word, count in Counter(words).items() if count == 1}

You can squeeze out a little more performance by avoiding the Counter and just keeping track of the words you've seen more than once.您可以通过避免使用 Counter 并跟踪您不止一次看到的单词来提高性能。 This effectively removes a loop from your function and lets you collect the words you've seen more than once in a single loop:这有效地从您的函数中删除了一个循环,并允许您在单个循环中收集多次看到的单词:

def UNKWords(words):
    seen = set()
    kept = set()
    for word in words:
        if word in seen:
            kept.add(word)
        else:
            seen.add(word)
    return kept

def replaceWords(words):   
    wordsToKeep = UNKWords(trainingWords)
    return ["<UNK>" if word not in wordsToKeep else word for word in trainingWords]

…and then use sets instead of lists to test for membership, as others have mentioned, since these allow constant time membership tests. …然后使用集合而不是列表来测试成员资格,正如其他人提到的那样,因为这些允许恒定时间成员资格测试。

In addition to @blhsing's answer I suggest using a frozenset;除了@blhsing 的回答,我还建议使用frozenset; also doing the replacement in-place (unless, of course, you need to keep the original list for other purposes):还进行就地替换(当然,除非您需要为其他目的保留原始列表):

def UNKWords(words):
    return frozenset(word for word, count in Counter(words).items() if count == 1)

wordsToBeReplaced = UNKWords(trainingWords)
for i, word in enumerate(trainingWords):
    if word in wordsToBeReplaced:
        trainingWords[i] = '<UNK>'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM