I am doing a data cleaning task on a text file full of sentences. After stemming these sentences I would like to get the frequency of the words in my stemmed list. However I am encountering a problem as when printing the stemmed list, stem_list, I am obtaining a list for every sentence like so :
[u'anyon', u'think', u'forgotten', u'day', u'parti', u'friend', u'friend', u'paymast', u'us', u'longer', u'memori']
[u'valu', u'friend', u'bought', u'properti', u'actual', u'relev', u'repres', u'actual', u'valu', u'properti']
[u'monster', u'wreck', u'reef', u'cargo', u'vessel', u'week', u'passeng', u'ship', u'least', u'24', u'hour', u'upload', u'com']
I would like to obtain the frequency of all of the words but I am only obtaining the frequency per sentence by using the following code:
fdist = nltk.FreqDist(stem_list)
for word, frequency in fdist.most_common(50):
print(u'{};{}'.format(word, frequency))
This is producing the following output: friend;2 paymast;1 longer;1 memori;1 parti;1 us;1 day;1 anyon;1 forgotten;1 think;1 actual;2 properti;2 valu;2 friend;1 repres;1 relev;1 bought;1 week;1 cargo;1 monster;1 hour;1 wreck;1 upload;1 passeng;1 least;1 reef;1 24;1 vessel;1 ship;1 com;1 within;1 area;1 territori;1 custom;1 water;1 3;1
The word 'friend' is being counted twice since it is in two different sentences. How would I be able to make it count friend once and display friend;3 in this case?
You could just concatenate everything in one list :
stem_list = [inner for outer in stem_list for inner in outer]
and process the same way you do.
Otherwise, you could keep the same code but instead of printing you create a dict and populate it with values you get. Each time you get a new word, you create the key, then you add the value.
all_words_count = dict()
for word, frequency in fdist.most_common(50):
if word in all_words_count : # Already found
all_words_count[word] += frequency
else : # Not found yet
all_words_count[word] = frequency
for word in all_words_count :
print(u'{};{}'.format(word, all_words_count[word]))
I think the easyest way is to combine the arrays before passing it to the function.
allwords = [inner for outer in stem_list for inner in outer]
fdist = nltk.FreqDist(allwords)
for word, frequency in fdist.most_common(50):
print(y'{};{}'.format(word, frequency))
or shorter:
fdist = nltk.FreqDist([inner for outer in stem_list for inner in outer])
for word, frequency in fdist.most_common(50):
print(y'{};{}'.format(word, frequency))
I think your input looks like:
stem_list = [[u'anyon', u'think', u'forgotten', u'day', u'parti', u'friend', u'friend', u'paymast', u'us', u'longer', u'memori'],
[u'valu', u'friend', u'bought', u'properti', u'actual', u'relev', u'repres', u'actual', u'valu', u'properti'],
[u'monster', u'wreck', u'reef', u'cargo', u'vessel', u'week', u'passeng', u'ship', u'least', u'24', u'hour', u'upload', u'com'],
[.....], etc for the other sentences ]
so you have two arrays - first for sentences and second for words in sentenc. With allwords = [inner for outer in stem_list for inner in outer] you run through the sentences and combine them as one array of words.
You could flatten your 2D array first with chain.from_iterable
:
fdist = nltk.FreqDist(chain.from_iterable(stem_list)):
for word, frequency in fdist.most_common(50):
print(u'{};{}'.format(word, frequency))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.