简体   繁体   中英

Count word/pos from file

I've almost solved an exercise from my python lectures. I've been asked to write a program that counts how often each word occurs in a file, and how often it has been tagged with which POS. The counts should be written to a new file, which is also given on the command line.

For instance,

python3 wordcount-pos.py wsj00-pos.txt counts-wsj00-pos.txt

should produce a output like this:

   Mortimer 1   NNP 1

   foul 1   JJ  1

   ...

   reported 16  VBN 7   VBD 9

   ...

   before   26  RB  6   IN  20

   ...

   allow    4   VB  2   VBP 2

My code produces an output such as:

   Mortimer 1   {NNP:   1}

   foul 1   {JJ: 1}

   ...

   reported 2   {VBN:   7   VBD:    9}

   ...

   before   2   {RB:    6   IN: 20}

   ...

   allow    2   {VB:    2   VBP:    2}

It doesn't print the occurrences of "word" in my dictionary

Here it is my code:

import sys
from collections import defaultdict


def main():
    if len(sys.argv) != 3:
        print('Usage: python poscount.py <input file>', file=sys.stderr)
        sys.exit(1)

    input_filename = sys.argv[1]
    output_filename = sys.argv[2]
    # your code
    freq = defaultdict(list)
    with open(input_filename) as f:
        for line in f:
            # skip empty lines
            if line.strip() != '':
                #  split a word/pos pair into two separate strings
                word, pos = line.strip().rsplit("/", 1)
                # add word and list of pos as k, v into "freq" dictionary
                freq[word].append(pos)

    for k, v in freq.items():
        D = defaultdict(list)
        for i, item in enumerate(v):
            D[item].append(i)
        D = {k: len(v) for k, v in D.items()}
        # Output file
        with open(output_filename, "a") as f:
            print(k + "\t" + str(len(D.items())) + "\t" + str(D), file=f)


if __name__ == '__main__':
    main()

file from where extract the data: https://paste.elnota.space/nezemivaku.sql

Partial content of the file:

Pierre/NNP

Vinken/NNP

,/,

61/CD

years/NNS

old/JJ

,/,

will/MD

join/VB

the/DT

board/NN

as/IN

a/DT

nonexecutive/JJ

director/NN

Nov./NNP

29/CD

./.

Mr./NNP Vinken/NNP

is/VBZ

chairman/NN

我认为这可以解决您的问题

print(k + "\t" + str(sum(D.values())) + "\t" + str(D), file=f)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM