如何在Python中計算給定語料庫的復數和單數

Question

希望您能幫助我完成任務。

我需要計算語料庫中的復數和單數。 我有一個語料庫，其語系具有以下結構：

['4', 'lanzas', 'lanza', 'NCFP000']

第一個位置[0]計數為數字（4），第二個[1]計數為形式（lanzas），第三位置[2]計數為引理（lanza），第四個位置[3]計數為a。類別（NCFP000），例如動詞，名詞等。因此，在該語料庫中，每個單詞都是根據其引理和類別構成的，如果單詞是單數，復數，男性或女性，則類別會向我們提供信息。

Here are some examples of lines from the corpus:

['1', 'CargÃ³', 'cargar', 'VMIS3S0']

['2', 'el', 'el', 'DA0MS0']

['3', 'camiÃ³n', 'camiÃ³n', 'NCMS000']

['4', 'con', 'con', 'SP']

['5', 'los', 'el', 'DA0MP0']

['6', 'trastos', 'trasto', 'NCMP000']

['7', 'mÃ¡s', 'mÃ¡s', 'RG']

['8', 'pesados', 'pesado', 'AQ0MP00']

['9', '.', '.', 'Fp']

因此，如您所見，最后位置[3]占了單詞的類別，因此AQ0MP00表示單詞是復數形式和形容詞。

我的問題是在這種情況下如何計算復數和單數？ 具體來說，我需要計算整個語料庫中的以下類別（NCFS000，NCFP000，NCMS000和NCMP000，代表復數，單數，女性和男性）。

到目前為止，我已經嘗試過了：

語料庫=開放（ 'F：/python/corpus-morf.txt'， 'R'）

文本=開放（ 'F：/python/deberes.txt'， 'R'）

線= corpus.readlines（）

對於我來說：

lista=i.split()

#print(lista)

p=len(lista)

if p >0:

    forma=lista[1].rstrip()

    lema=lista[2].rstrip()

    categoria=lista[3].rstrip()

    aa=[forma,lema,categoria]

我被困在這里。

你有什么想法？ 衷心感謝您的幫助。

Answer 1

這是一種方法-請注意，這涵蓋了所有類別，因此您只需要針對所關注的字典對結果字典進行過濾：

from collections import Counter

corpus = [
  ['1', 'CargÃ³', 'cargar', 'VMIS3S0'],
  ['2', 'el', 'el', 'DA0MS0'],
  ['3', 'camiÃ³n', 'camiÃ³n', 'NCMS000'],
  ['4', 'con', 'con', 'SP'],
  ['5', 'los', 'el', 'DA0MP0'],
  ['6', 'trastos', 'trasto', 'NCMP000'],
  ['7', 'mÃ¡s', 'mÃ¡s', 'RG'],
  ['8', 'pesados', 'pesado', 'AQ0MP00'],
]

print(Counter(x[3] for x in corpus))

計數器（{'VMIS3S0'：1，'DA0MS0'：1，'NCMS000'：1，'SP'：1，'DA0MP0'：1，'NCMP000'：1，'RG'：1，'AQ0MP00'：1 }）

如何在Python中計算給定語料庫的復數和單數

問題描述

1 個解決方案

解決方案1
0 2018-11-01 11:34:37

如何在Python中計算給定語料庫的復數和單數

問題描述

1 個解決方案

解決方案1 0 2018-11-01 11:34:37

解決方案1
0 2018-11-01 11:34:37