简体   繁体   English

不含NLTK的python中的POS标记器

[英]POS tagger in python without NLTK

I am trying to make a POS tagger for determiners and prepositions of Sorani Kurdish. 我正在尝试为Sorani Kurdish的确定者和介词制作POS标记器。 I am using the following code to put every tag after each proposition or determiner in my Kurdish text. 我正在使用以下代码将每个标记放在我的库尔德文本中每个命题或确定词之后。

import os
SOR = open("SOR-1.txt", "r+", encoding = 'utf-8')
old_text = SOR.read()
punkt = [".", "!", ",", ":", ";"]
text = ""
for i in old_text:
    if i in punkt:
        text+=" "+i
    else:
        text += i

d = {"DET":["ئێمە" , "ئێوە" , "ئەم" , "ئەو" , "ئەوان" , "ئەوەی", "چەند" ], "PREP":["بۆ","بێ","بێجگە","بە","بەبێ","بەدەم","بەردەم","بەرلە","بەرەوی","بەرەوە","بەلای","بەپێی","تۆ","تێ","جگە","دوای","دەگەڵ","سەر","لێ","لە","لەبابەت","لەباتی","لەبارەی","لەبرێتی","لەبن","لەبەینی","لەبەر","لەدەم","لەرێ","لەرێگا","لەرەوی","لەسەر","لەلایەن","لەناو","لەنێو","لەو","لەپێناوی","لەژێر","لەگەڵ","ناو","نێوان","وەک","وەک","پاش","پێش","" ], "punkt":[".", ",", "!"]}

text = text.split()
for w in text:
    for pos in d:
        if w in d[pos]:
            SOR.write(w+"/"+pos+" ")
SOR.close()

What I want to do is to add POS tags inside the text after each of the words in the defined dictionary, but the result is a separate list of words and POS tags at the end of the file. 我想做的是在定义的字典中每个单词之后的文本内添加POS标签,但是结果是在文件末尾单独列出了单词和POS标签。

keep in mind that old_text is a single string. 请记住, old_text是单个字符串。 So when you loop through it as in 所以当你像这样循环遍历它时

for i in old_text:
    if i in punkt:

you are looping through characters. 您正在遍历字符。 I think you intend to loop through lines of old_text instead. 我认为您打算改为循环浏览old_text行。 If that is the case, you could open the file using a with statement specifying read and write modes. 如果是这种情况,则可以使用with语句指定readwrite模式来打开文件。 Something like: 就像是:

with open("SOR-1.txt", 'r+', encoding = 'utf-8') as f:
    old_text = f.readlines()
    for line in old_text:
        for punctuationMark in punct:
            if punctuationMark in line.strip('\n'):     #when you read the file, every line will be terminated with newline character `'\n'`
                #give more instructions

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM