如何有效地从另一个数组中删除一个数组的元素

Question

I'm doing an analysis of a text corpus with about 135k documents (several pages per document), and a vocabulary of about 800k words.我正在分析一个包含大约 135k 文档（每个文档有几页）和大约 800k 单词的词汇的文本语料库。 I noticed that something like half of the vocabulary is words with a frequency of 1 or 2, so I want to remove them.我注意到有一半的词汇是频率为 1 或 2 的单词，所以我想删除它们。

So I'm running something like this:所以我正在运行这样的东西：

remove_indices = np.array(index_df[index_df['frequency'] <= 2]['index']).astype(int)

for file_name in tqdm(corpus):
    content = corpus[file_name].astype(int)
    content = [index for index in content if index not in remove_indices]
    corpus[file_name] = np.array(content).astype(np.uint32)

Where corpus looks something like: corpus看起来像：

{
    'filename1.txt': np.array([43, 177718, 3817, ...., 28181]).astype(np.uint32),
    'filename2.txt': ....
}

and each word was previously encoded to a positive integer index.并且每个单词之前都被编码为正 integer 索引。

The problem lies in content = [index for index in content if index not in remove_indices] which needs to go through len(remove_indices) * len(content) number of checks with each iteration.问题在于content = [index for index in content if index not in remove_indices]这需要 go 通过len(remove_indices) * len(content)每次迭代的检查次数。 This would take forever (tqdm is telling me 100h+).这需要很长时间（tqdm 告诉我 100h+）。 Any tips on how to speed this up?有关如何加快速度的任何提示？

What I've tried so far到目前为止我尝试过的

Taking advantage of the fact that if the words has frequency 1 or 2 only, we can remove it from remove_indices after it has been removed from the corpus.利用如果单词只有频率 1 或 2 的事实，我们可以在它从语料库中删除后从remove_indices中删除它。 Still taking forever...仍然需要永远...

Answer 1

You could use numpy.isin() method https://numpy.org/devdocs/reference/generated/numpy.isin.html instead of this list comprehension.您可以使用numpy.isin()方法https://numpy.org/devdocs/reference/generated/numpy.isin.html代替此列表理解。

Alternatively, you could create a set of existing words/indices.或者，您可以创建一set现有的单词/索引。 Then this in operation will be a O(1) instead of O(n) (where n is the length of the array).那么这个操作将in O(1) 而不是 O(n) （其中 n 是数组的长度）。

如何有效地从另一个数组中删除一个数组的元素

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-08-14 09:08:57

如何有效地从另一个数组中删除一个数组的元素

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-08-14 09:08:57

解决方案1
3 已采纳 2020-08-14 09:08:57