简体   繁体   English

如何有效地从另一个数组中删除一个数组的元素

[英]How to efficiently remove elements of one array from another

I'm doing an analysis of a text corpus with about 135k documents (several pages per document), and a vocabulary of about 800k words.我正在分析一个包含大约 135k 文档(每个文档有几页)和大约 800k 单词的词汇的文本语料库。 I noticed that something like half of the vocabulary is words with a frequency of 1 or 2, so I want to remove them.我注意到有一半的词汇是频率为 1 或 2 的单词,所以我想删除它们。

So I'm running something like this:所以我正在运行这样的东西:

remove_indices = np.array(index_df[index_df['frequency'] <= 2]['index']).astype(int)

for file_name in tqdm(corpus):
    content = corpus[file_name].astype(int)
    content = [index for index in content if index not in remove_indices]
    corpus[file_name] = np.array(content).astype(np.uint32)

Where corpus looks something like: corpus看起来像:

{
    'filename1.txt': np.array([43, 177718, 3817, ...., 28181]).astype(np.uint32),
    'filename2.txt': ....
}

and each word was previously encoded to a positive integer index.并且每个单词之前都被编码为正 integer 索引。

The problem lies in content = [index for index in content if index not in remove_indices] which needs to go through len(remove_indices) * len(content) number of checks with each iteration.问题在于content = [index for index in content if index not in remove_indices]这需要 go 通过len(remove_indices) * len(content)每次迭代的检查次数。 This would take forever (tqdm is telling me 100h+).这需要很长时间(tqdm 告诉我 100h+)。 Any tips on how to speed this up?有关如何加快速度的任何提示?

What I've tried so far到目前为止我尝试过的

  • Taking advantage of the fact that if the words has frequency 1 or 2 only, we can remove it from remove_indices after it has been removed from the corpus.利用如果单词只有频率 1 或 2 的事实,我们可以在它从语料库中删除后从remove_indices中删除它。 Still taking forever...仍然需要永远...

You could use numpy.isin() method https://numpy.org/devdocs/reference/generated/numpy.isin.html instead of this list comprehension.您可以使用numpy.isin()方法https://numpy.org/devdocs/reference/generated/numpy.isin.html代替此列表理解。

Alternatively, you could create a set of existing words/indices.或者,您可以创建一set现有的单词/索引。 Then this in operation will be a O(1) instead of O(n) (where n is the length of the array).那么这个操作将in O(1) 而不是 O(n) (其中 n 是数组的长度)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 numpy 如何根据 boolean 条件从另一个数组中删除一个数组中的元素 - numpy how to remove elements in one array based on boolean conditions from another array 如何从大型numpy数组中有效删除一系列行? - How can one efficiently remove a range of rows from a large numpy array? 如果存在于另一个数组中,则从一个数组中删除元素,保留重复项 - NumPy / Python - Remove elements from one array if present in another array, keep duplicates - NumPy / Python 如何有效地从python字典中删除一些元素? - How to remove efficiently some elements from a python dictionary? 如何从python中的列表中有效地删除相同长度的元素 - How to efficiently remove the same-length elements from a list in python 如何有效地从 pandas dataframe 系列中删除元素 - How to efficiently remove elements from series of pandas dataframe 如何有效地从具有特定值模式的字典中删除元素? - How to efficiently remove elements from dicts that have certain value patterns? 如何根据索引更有效地将一个数组中的值分配给另一个数组? - How can I assign values from one array to another according to the index more efficiently? 如何在python3中有效地将位从一个字节数组打包到另一个字节数组? - How do I pack bits from one byte array to another efficiently in python3? 有效地从数组中弹出元素 - Popping elements from array efficiently
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM