I have a text where all the words are tagged with "parts of speech" tags. example of the text here:
What/NOUN could/VERB happen/VERB next/ADJ ?/PUNCT
I need to find all the occurrences where there is a /PUNCT
followed by either NOUN
, PRON
or PROPN
- and also count which one occurs the most often.
So one of the answers would appear like this: ?/PUNCT What/NOUN
or ./PUNCT What/NOUN
Further on the word "Deal" appears 6 times, and I need to show this by code.
I am not allowed to use NLTK, only Collections.
Tried several different things, but don't really know what to do here. I think I need to use defaultdict, and then somehow do a while loop, that gives me back a list with the right connectives.
Here is a test program that does what you want.
It first splits the long string by spaces ' '
which creates a list of word/class elements. The for loop then check if the combination of PUNCT followed by NOUN, PRON, or PROPN occurs and saves that to a list.
The code is as follows:
from collections import Counter
string = "What/NOUN could/VERB happen/VERB next/ADJ ?/PUNCT What/NOUN could/VERB happen/VERB next/ADJ ?/PUNCT"
words = string.split(' ')
found = []
for n, (first, second) in enumerate(zip(words[:-1], words[1:])):
first_class = first.split('/')[1]
second_class = second.split('/')[1]
if first_class == 'PUNCT' and second_class in ["NOUN", "PRON", "PROPN"]:
print(f"Found occurence at data list index {n} and {n+1} with {first_class}, {second_class}")
found.append(f'{words[n]} {words[n+1]}')
To count the words:
words_only = [i.split('/')[0] for i in words]
word_counts = Counter(words_only).most_common()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.