I have a nested list of every sentence from a corpus. The set is all the words that occur more than once. How would I check if each word within the list is in the set containing only words that occur once? I then need to replace all words that occur more than once with the str UNK.
I tried:
for sent in tokenized_sents:
for word in sent:
if word in set:
word = '<UNK>'
You can create a dictionary which keeps tracks of the number of occurrences of each word in your corpus with collections.Counter
from collections import Counter
corpus = [['Hello', ',', 'my', 'name', 'is', 'Walter'], ['I', 'like', 'my', 'cats']]
corpus_unnested = []
for sentence in corpus:
corpus_unnested += sentence
my_dict = Counter(corpus_unnested)
for i, sentence in enumerate(corpus):
for j, word in enumerate(sentence):
if my_dict[word] > 1:
corpus[i][j] = '<UNK>'
>>> print(corpus)
[['Hello', ',', '<UNK>', 'name', 'is', 'Walter'], ['I', 'like', '<UNK>', 'cats']]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.