如果它们在另一个列表中，则删除列表的标记（提高速度）

Question

I have a list of lists composed by tokenized texts.我有一个由标记化文本组成的列表列表。 The list lenght is around 1.200.000 texts.列表长度约为 1.200.000 个文本。 An example of this list is shown below:此列表的示例如下所示：

texts = [
        ['hi', 'how', 'are', 'you'],
        ['i', 'am', 'fine', 'thank', 'you'],
        ...

I'm trying to remove for each list words that appear in another list.我正在尝试删除出现在另一个列表中的每个列表单词。 This is a list which is composed around 90.000 words and it is seemed to the next:这是一个由大约 90.000 个单词组成的列表，它似乎是下一个：

removing_words = ['ok', 'bye', 'hi', ...]

My code to do this is:我这样做的代码是：

texts = [[token for token in text if token not in removing_words] for text in texts]

It works fine, but it is very, very slow.它工作正常，但是非常非常慢。 Any idea of how can I improve this?知道如何改进吗？ Thank you so much!太感谢了！

Answer 1

I would look at how the tokens are generated.我会看看令牌是如何生成的。 Try to create a dictionary of all tokens and its frequency.尝试创建一个包含所有标记及其频率的字典。 The dict would keep a count of how many occurrences the token appears. dict 将记录令牌出现的次数。 The keys of the dictionary would be unique().字典的键是唯一的（）。

#### PASS 1 - create frequency dictionary
FreqDict = defaultdict(int)
for tList in texts:
    for token in tList: FreqDict[token] +=1 
print(FreqDict)

#### PASS2 - Remove tokens > 1 
newtexts = [' '.join(['' if FreqDict[token] != 1 else token for token in tList]).split() for tList in texts]

如果它们在另一个列表中，则删除列表的标记（提高速度）

问题描述

1 个解决方案

解决方案1
0 2020-12-14 00:41:38

如果它们在另一个列表中，则删除列表的标记（提高速度）

问题描述

1 个解决方案

解决方案1 0 2020-12-14 00:41:38

解决方案1
0 2020-12-14 00:41:38