[英]Replacing elements in long list Python
I am trying to replace a number of elements (around 40k) from a much larger list (3 million elements) with a tag.我试图用标签替换更大的列表(300 万个元素)中的多个元素(大约 40k)。 The code below works however it is extremely slow.下面的代码有效,但速度非常慢。
def UNKWords(words):
words = Counter(words)
wordsToBeReplaced = []
for key, value in words.items():
if(value == 1):
wordsToBeReplaced.append(key)
return wordsToBeReplaced
wordsToBeReplaced = UNKWords(trainingWords)
replacedWordsList = ["<UNK>" if word in wordsToBeReplaced else word for word in trainingWords]
Is there a more efficient way of replacing elements in such a large list?有没有更有效的方法来替换如此大列表中的元素?
You can make wordsToBeReplaced
a set instead so that lookups can be done in constant time on average rather than linear time:您可以将wordsToBeReplaced
一个集合,以便可以在平均恒定时间内而不是线性时间内完成查找:
def UNKWords(words):
return {word for word, count in Counter(words).items() if count == 1}
You can squeeze out a little more performance by avoiding the Counter and just keeping track of the words you've seen more than once.您可以通过避免使用 Counter 并跟踪您不止一次看到的单词来提高性能。 This effectively removes a loop from your function and lets you collect the words you've seen more than once in a single loop:这有效地从您的函数中删除了一个循环,并允许您在单个循环中收集多次看到的单词:
def UNKWords(words):
seen = set()
kept = set()
for word in words:
if word in seen:
kept.add(word)
else:
seen.add(word)
return kept
def replaceWords(words):
wordsToKeep = UNKWords(trainingWords)
return ["<UNK>" if word not in wordsToKeep else word for word in trainingWords]
…and then use sets instead of lists to test for membership, as others have mentioned, since these allow constant time membership tests. …然后使用集合而不是列表来测试成员资格,正如其他人提到的那样,因为这些允许恒定时间成员资格测试。
In addition to @blhsing's answer I suggest using a frozenset;除了@blhsing 的回答,我还建议使用frozenset; also doing the replacement in-place (unless, of course, you need to keep the original list for other purposes):还进行就地替换(当然,除非您需要为其他目的保留原始列表):
def UNKWords(words):
return frozenset(word for word, count in Counter(words).items() if count == 1)
wordsToBeReplaced = UNKWords(trainingWords)
for i, word in enumerate(trainingWords):
if word in wordsToBeReplaced:
trainingWords[i] = '<UNK>'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.