简体   繁体   中英

How to check if a list item within a nested list exists in a set?

I have a nested list of every sentence from a corpus. The set is all the words that occur more than once. How would I check if each word within the list is in the set containing only words that occur once? I then need to replace all words that occur more than once with the str UNK.

I tried:

for sent in tokenized_sents:
    for word in sent:
        if word in set:
           word = '<UNK>'

You can create a dictionary which keeps tracks of the number of occurrences of each word in your corpus with collections.Counter

from collections import Counter

corpus = [['Hello', ',', 'my', 'name', 'is', 'Walter'], ['I', 'like', 'my', 'cats']]

corpus_unnested = []
for sentence in corpus:
    corpus_unnested += sentence
my_dict = Counter(corpus_unnested)

for i, sentence in enumerate(corpus):
    for j, word in enumerate(sentence):
        if my_dict[word] > 1:
            corpus[i][j] = '<UNK>'
>>> print(corpus)
[['Hello', ',', '<UNK>', 'name', 'is', 'Walter'], ['I', 'like', '<UNK>', 'cats']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM