[英]Count word/pos from file
I've almost solved an exercise from my python lectures. 我的python讲座几乎解决了一个练习。 I've been asked to write a program that counts how often each word occurs in a file, and how often it has been tagged with which POS.
我被要求写一个程序来计算每个单词在文件中出现的频率,以及它被哪个POS标记的频率。 The counts should be written to a new file, which is also given on the command line.
计数应写入一个新文件,该文件也在命令行中给出。
For instance, 例如,
python3 wordcount-pos.py wsj00-pos.txt counts-wsj00-pos.txt
should produce a output like this: 应该产生这样的输出:
Mortimer 1 NNP 1
foul 1 JJ 1
...
reported 16 VBN 7 VBD 9
...
before 26 RB 6 IN 20
...
allow 4 VB 2 VBP 2
My code produces an output such as: 我的代码产生如下输出:
Mortimer 1 {NNP: 1}
foul 1 {JJ: 1}
...
reported 2 {VBN: 7 VBD: 9}
...
before 2 {RB: 6 IN: 20}
...
allow 2 {VB: 2 VBP: 2}
It doesn't print the occurrences of "word" in my dictionary 它不会在我的词典中显示“单词”的出现
Here it is my code: 这是我的代码:
import sys
from collections import defaultdict
def main():
if len(sys.argv) != 3:
print('Usage: python poscount.py <input file>', file=sys.stderr)
sys.exit(1)
input_filename = sys.argv[1]
output_filename = sys.argv[2]
# your code
freq = defaultdict(list)
with open(input_filename) as f:
for line in f:
# skip empty lines
if line.strip() != '':
# split a word/pos pair into two separate strings
word, pos = line.strip().rsplit("/", 1)
# add word and list of pos as k, v into "freq" dictionary
freq[word].append(pos)
for k, v in freq.items():
D = defaultdict(list)
for i, item in enumerate(v):
D[item].append(i)
D = {k: len(v) for k, v in D.items()}
# Output file
with open(output_filename, "a") as f:
print(k + "\t" + str(len(D.items())) + "\t" + str(D), file=f)
if __name__ == '__main__':
main()
file from where extract the data: https://paste.elnota.space/nezemivaku.sql 从中提取数据的文件: https : //paste.elnota.space/nezemivaku.sql
Partial content of the file: 文件的部分内容:
Pierre/NNP 皮埃尔/ NNP
Vinken/NNP 温肯/ NNP
,/, ,/,
61/CD 61 / CD
years/NNS 年/ NNS
old/JJ 老/ JJ
,/, ,/,
will/MD 意志/ MD
join/VB 加入/ VB
the/DT / DT
board/NN 板/ NN
as/IN as / IN
a/DT / DT
nonexecutive/JJ 非执行/ JJ
director/NN 导演/ NN
Nov./NNP 11月/ NNP
29/CD 29 /张
./. ./。
Mr./NNP Vinken/NNP 先生/ NNP温肯/ NNP
is/VBZ 是/ VBZ
chairman/NN 董事长/ NN
我认为这可以解决您的问题
print(k + "\t" + str(sum(D.values())) + "\t" + str(D), file=f)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.