简体   繁体   中英

How to efficiently remove elements of one array from another

I'm doing an analysis of a text corpus with about 135k documents (several pages per document), and a vocabulary of about 800k words. I noticed that something like half of the vocabulary is words with a frequency of 1 or 2, so I want to remove them.

So I'm running something like this:

remove_indices = np.array(index_df[index_df['frequency'] <= 2]['index']).astype(int)

for file_name in tqdm(corpus):
    content = corpus[file_name].astype(int)
    content = [index for index in content if index not in remove_indices]
    corpus[file_name] = np.array(content).astype(np.uint32)

Where corpus looks something like:

{
    'filename1.txt': np.array([43, 177718, 3817, ...., 28181]).astype(np.uint32),
    'filename2.txt': ....
}

and each word was previously encoded to a positive integer index.

The problem lies in content = [index for index in content if index not in remove_indices] which needs to go through len(remove_indices) * len(content) number of checks with each iteration. This would take forever (tqdm is telling me 100h+). Any tips on how to speed this up?

What I've tried so far

  • Taking advantage of the fact that if the words has frequency 1 or 2 only, we can remove it from remove_indices after it has been removed from the corpus. Still taking forever...

You could use numpy.isin() method https://numpy.org/devdocs/reference/generated/numpy.isin.html instead of this list comprehension.

Alternatively, you could create a set of existing words/indices. Then this in operation will be a O(1) instead of O(n) (where n is the length of the array).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM