简体   繁体   English

在Python中创建多层字典POS标记器

[英]Creating multilevel dictionary POS tagger in Python

I have a text file with POS tags. 我有一个带有POS标签的文本文件。 For example: 例如:

"DT The NN dog VB jumps..." “ DT NN狗VB跳起来了……”

I need to create a dictionary where the keys of the entries are the words and the values are another dictionary with the tags as keys and the frequency of the tags as values. 我需要创建一个字典,其中条目的键是单词,值是另一个词典,标签为键,标签的频率为值。 So what I would need would look like this: 所以我需要的是这样的:

{'The': {'DT': 47}}, {'dog': {'VB': 32}} ... {'The':{'DT':47}},{'dog':{'VB':32}} ...

I'm at a total loss right now. 我现在全亏了。 I've started by taking my text file and splitting it into a list of strings with, so that it is a list like 我首先开始获取文本文件,然后将其分割为一个字符串列表,这样它就是一个列表

'DT The' 'NN dog' 'VB jumps' 'DT The'NN dog''VB jumps'

I'm not sure if this is even the right first step or what. 我不确定这是正确的第一步还是什么。 Please help! 请帮忙!

This approach should give you the structure you're looking for, with the POS counts being the full count of that tag within the corpus presented. 这种方法应该为您提供所需的结构,而POS计数就是所提供的语料库中该标签的全部计数。

NOTE: The RETAIN_PUNCTUATION_FLAG and RETAIN_CASE_FLAG allow you to toggle the behavior to either strip punctuation before evaluation, make the case uniform or retain upper/lower casing, or simply do both. 注意: RETAIN_PUNCTUATION_FLAGRETAIN_CASE_FLAG允许您在评估之前将行为切换为带标点,使表壳均匀或保留上/下壳体,或简单地同时进行。 Here, they're both assigned False , all words will be handled as lowercase, and all ASCII punctuation will be stripped before evaluation. 在这里,它们都被赋值为False ,所有单词将被视为小写,并且所有ASCII标点符号将在评估前被去除。

I've added word_list and pos_list for alternative listing. 我添加了word_listpos_list作为替代列表。

from string import punctuation

RETAIN_PUNCTUATION_FLAG = False
RETAIN_CASE_FLAG = False

string = "DT The NN dog VB jumps DT the NN sofa. DT The NN cat VB pages DT the NN page."

punctuation_strip_table = str.maketrans('', '', punctuation)
if RETAIN_CASE_FLAG and RETAIN_PUNCTUATION_FLAG:
    pass
elif RETAIN_CASE_FLAG and not RETAIN_PUNCTUATION_FLAG:
    string = string.translate(punctuation_strip_table)
elif not RETAIN_CASE_FLAG and RETAIN_PUNCTUATION_FLAG:
    string = string.casefold()
elif not RETAIN_CASE_FLAG and not RETAIN_PUNCTUATION_FLAG:
    string = string.casefold().translate(punctuation_strip_table)

list_all = string.split(' ')
pos_word_pairs = set(zip(
            list_all[0:][::2],
            list_all[1:][::2]))

pos_list = {pos.upper(): {
    'count': list_all.count(pos),
    'words': [
        word
        for match_pos, word in pos_word_pairs
        if match_pos == pos]
    }
    for pos in set(list_all[0:][::2])}
word_list = {word: {
    'count': list_all.count(word),
    'pos': [
        pos.upper()
        for pos, match_word in pos_word_pairs
        if match_word == word]
    }
    for word in set(list_all[1:][::2])}
paired = {
        word: {
            pos.upper():
            list_all.count(pos)}
        for pos, word
        in pos_word_pairs}

print('pos_list:', pos_list)
print()
print('word_list:', word_list)
print()
print('paired:',paired)

Output: 输出:

pos_list: {'VB': {'count': 2, 'words': ['pages', 'jumps']}, 'NN': {'count': 4, 'words': ['page', 'dog', 'sofa', 'cat']}, 'DT': {'count': 4, 'words': ['the']}}

word_list: {'dog': {'count': 1, 'pos': ['NN']}, 'cat': {'count': 1, 'pos': ['NN']}, 'jumps': {'count': 1, 'pos': ['VB']}, 'the': {'count': 4, 'pos': ['DT']}, 'page': {'count': 1, 'pos': ['NN']}, 'sofa': {'count': 1, 'pos': ['NN']}, 'pages': {'count': 1, 'pos': ['VB']}}

paired: {'pages': {'VB': 2}, 'jumps': {'VB': 2}, 'the': {'DT': 4}, 'page': {'NN': 4}, 'dog': {'NN': 4}, 'sofa': {'NN': 4}, 'cat': {'NN': 4}}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM