简体   繁体   English

计算文件中的单词/位置

[英]Count word/pos from file

I've almost solved an exercise from my python lectures. 我的python讲座几乎解决了一个练习。 I've been asked to write a program that counts how often each word occurs in a file, and how often it has been tagged with which POS. 我被要求写一个程序来计算每个单词在文件中出现的频率,以及它被哪个POS标记的频率。 The counts should be written to a new file, which is also given on the command line. 计数应写入一个新文件,该文件也在命令行中给出。

For instance, 例如,

python3 wordcount-pos.py wsj00-pos.txt counts-wsj00-pos.txt

should produce a output like this: 应该产生这样的输出:

   Mortimer 1   NNP 1

   foul 1   JJ  1

   ...

   reported 16  VBN 7   VBD 9

   ...

   before   26  RB  6   IN  20

   ...

   allow    4   VB  2   VBP 2

My code produces an output such as: 我的代码产生如下输出:

   Mortimer 1   {NNP:   1}

   foul 1   {JJ: 1}

   ...

   reported 2   {VBN:   7   VBD:    9}

   ...

   before   2   {RB:    6   IN: 20}

   ...

   allow    2   {VB:    2   VBP:    2}

It doesn't print the occurrences of "word" in my dictionary 它不会在我的词典中显示“单词”的出现

Here it is my code: 这是我的代码:

import sys
from collections import defaultdict


def main():
    if len(sys.argv) != 3:
        print('Usage: python poscount.py <input file>', file=sys.stderr)
        sys.exit(1)

    input_filename = sys.argv[1]
    output_filename = sys.argv[2]
    # your code
    freq = defaultdict(list)
    with open(input_filename) as f:
        for line in f:
            # skip empty lines
            if line.strip() != '':
                #  split a word/pos pair into two separate strings
                word, pos = line.strip().rsplit("/", 1)
                # add word and list of pos as k, v into "freq" dictionary
                freq[word].append(pos)

    for k, v in freq.items():
        D = defaultdict(list)
        for i, item in enumerate(v):
            D[item].append(i)
        D = {k: len(v) for k, v in D.items()}
        # Output file
        with open(output_filename, "a") as f:
            print(k + "\t" + str(len(D.items())) + "\t" + str(D), file=f)


if __name__ == '__main__':
    main()

file from where extract the data: https://paste.elnota.space/nezemivaku.sql 从中提取数据的文件: https : //paste.elnota.space/nezemivaku.sql

Partial content of the file: 文件的部分内容:

Pierre/NNP 皮埃尔/ NNP

Vinken/NNP 温肯/ NNP

,/, ,/,

61/CD 61 / CD

years/NNS 年/ NNS

old/JJ 老/ JJ

,/, ,/,

will/MD 意志/ MD

join/VB 加入/ VB

the/DT / DT

board/NN 板/ NN

as/IN as / IN

a/DT / DT

nonexecutive/JJ 非执行/ JJ

director/NN 导演/ NN

Nov./NNP 11月/ NNP

29/CD 29 /张

./. ./。

Mr./NNP Vinken/NNP 先生/ NNP温肯/ NNP

is/VBZ 是/ VBZ

chairman/NN 董事长/ NN

我认为这可以解决您的问题

print(k + "\t" + str(sum(D.values())) + "\t" + str(D), file=f)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM