[英]How to efficiently remove elements of one array from another
I'm doing an analysis of a text corpus with about 135k documents (several pages per document), and a vocabulary of about 800k words.我正在分析一个包含大约 135k 文档(每个文档有几页)和大约 800k 单词的词汇的文本语料库。 I noticed that something like half of the vocabulary is words with a frequency of 1 or 2, so I want to remove them.
我注意到有一半的词汇是频率为 1 或 2 的单词,所以我想删除它们。
So I'm running something like this:所以我正在运行这样的东西:
remove_indices = np.array(index_df[index_df['frequency'] <= 2]['index']).astype(int)
for file_name in tqdm(corpus):
content = corpus[file_name].astype(int)
content = [index for index in content if index not in remove_indices]
corpus[file_name] = np.array(content).astype(np.uint32)
Where corpus
looks something like: corpus
看起来像:
{
'filename1.txt': np.array([43, 177718, 3817, ...., 28181]).astype(np.uint32),
'filename2.txt': ....
}
and each word was previously encoded to a positive integer index.并且每个单词之前都被编码为正 integer 索引。
The problem lies in content = [index for index in content if index not in remove_indices]
which needs to go through len(remove_indices) * len(content)
number of checks with each iteration.问题在于
content = [index for index in content if index not in remove_indices]
这需要 go 通过len(remove_indices) * len(content)
每次迭代的检查次数。 This would take forever (tqdm is telling me 100h+).这需要很长时间(tqdm 告诉我 100h+)。 Any tips on how to speed this up?
有关如何加快速度的任何提示?
What I've tried so far到目前为止我尝试过的
remove_indices
after it has been removed from the corpus.remove_indices
中删除它。 Still taking forever...You could use numpy.isin()
method https://numpy.org/devdocs/reference/generated/numpy.isin.html instead of this list comprehension.您可以使用
numpy.isin()
方法https://numpy.org/devdocs/reference/generated/numpy.isin.html代替此列表理解。
Alternatively, you could create a set
of existing words/indices.或者,您可以创建一
set
现有的单词/索引。 Then this in
operation will be a O(1) instead of O(n) (where n is the length of the array).那么这个操作将
in
O(1) 而不是 O(n) (其中 n 是数组的长度)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.