简体   繁体   中英

Python: count tag per word per POS occurrences

File /content/try list.txt contains:

DT The NNP Fulton NNP County NNP Grand NNP Jury VBD said NNP Friday 0 DT an NN investigation IN of NNP Atlanta POS 's JJ recent JJ primary NN election VBD produced DT no NN evidence '' '' IN that DT any NNS irregularities VBD took NN place . . DT The NN jury RB further VBD said IN in JJ term-end NNS presentments IN that DT the NNP City NNP Executive
fname = open('/content/try list.txt', "r")
counts = dict()
for line in fname:
    words = line.split()

for word in words:
    if word not in counts:
        counts[word] = 1
    else:
        counts[word] += 1
print(counts)
"""
Output: {
'DT': 10, 'The': 2, 'NNP': 11, 'Fulton': 1, 'County': 1,
'Grand': 1, 'Jury': 1, 'VBD': 5, 'said': 2, 'Friday': 1,
'0': 1, 'an': 1, 'NN': 9, 'investigation': 1, 'IN': 8,
'of': 4, 'Atlanta': 2, 'POS': 1, "'s": 1, 'JJ': 4, 'recent': 1,
}
"""

It's counting the occurrence of each word and stage but how can I do words wise?

Expected output should be:

The-->DT:48, Fulton--> NNP:28

If you want to count how many times a word has a certain pos, you need to iterate over the POStag and the word at the same time. Also you need a more complicated data structure for example a dictionary of word that contains dictionnarys of POS so you get word -> pos -> count .

with open('/content/try list.txt', "r") as fname:
    # If all your document is in one file you don't need to do 'for line in fname'
    words = fname.read().split()

counts = dict()
# range(0, len(words), 2) will be [0, 2, 4, 6, ...]
for i in range(0, len(words), 2):
    pos = words[i]
    word = words[i+1]

    # Ensure word is in counts
    if word not in counts:
        counts[word] = dict()

    # Ensure pos is in counts[word]
    if pos not in counts[word]:
        counts[word][pos] = 0

    # Actual counting !
    counts[word][pos] += 1
print(counts)

You can also use defaultdict where you don't need to check whether the key exist or not !

from collections import defaultdict

counts = defaultdict(lambda: defaultdict(int))
# range(len(words), 2) will be [0, 2, 4, 6, ...]
for i in range(0, len(words), 2):
    pos = words[i]
    word = words[i+1]

    # Actual counting !
    counts[word][pos] += 1
print(counts)  # This won't print as pretty but has the same result
fr = open('/content/try list.txt', "r").read()
cleantxt = text.replace("''","").replace(".","").replace("0","").split()
from collections import Counter
counts = Counter(list(zip(cleantxt[1::2],cleantxt[::2])))
print(counts)

Output:

Counter({('The', 'DT'): 2,
     ('Fulton', 'NNP'): 1,
     ('County', 'NNP'): 1,
     ('Grand', 'NNP'): 1,
     ('Jury', 'NNP'): 1,
     ('said', 'VBD'): 2,
     ('Friday', 'NNP'): 1,
     ('an', 'DT'): 1,
     ('investigation', 'NN'): 1,
     ('of', 'IN'): 1,....

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM