简体   繁体   English

固定模式的Python-crfsuite标签

[英]Python-crfsuite labeling in fixed pattern

I'm trying to create a CRF model that segments Japanese sentences into words. 我正在尝试创建一个将日文句子分割成单词的CRF模型。 At the moment I'm not worried about perfect results as it's just a test. 目前,我并不担心完美的结果,这只是一个测试。 The training goes fine but when it's finished it always gives the same guess for every sentence I try to tag. 训练进行得很好,但是当它完成时,它总是会为我尝试标记的每个句子给出相同的猜测。

"""Labels: X: Character is mid word, S: Character starts a word, E:Character ends a word, O: One character word"""
    Sentence:広辞苑や大辞泉には次のようにある。
    Prediction:['S', 'X', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E']
    Truth:['S', 'X', 'E', 'O', 'S', 'X', 'E', 'O', 'O', 'O', 'O', 'S', 'E', 'O', 'S', 'E', 'O']
    Sentence:他にも、言語にはさまざまな分類がある。
    Prediction:['S', 'X', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E']
    Truth:['O', 'O', 'O', 'O', 'S', 'E', 'O', 'O', 'S', 'X', 'X', 'X', 'E', 'S', 'E', 'O', 'S', 'E', 'O']

When looking at the transition info for the model: 查看模型的转换信息时:

{('E', 'E'): -3.820618,
 ('E', 'O'): 3.414133,
 ('E', 'S'): 2.817927,
 ('E', 'X'): -3.056175,
 ('O', 'E'): -4.249522,
 ('O', 'O'): 2.583123,
 ('O', 'S'): 2.601341,
 ('O', 'X'): -4.322003,
 ('S', 'E'): 7.05034,
 ('S', 'O'): -4.817578,
 ('S', 'S'): -4.400028,
 ('S', 'X'): 6.104851,
 ('X', 'E'): 4.985887,
 ('X', 'O'): -5.141898,
 ('X', 'S'): -4.499069,
 ('X', 'X'): 4.749289}

This looks good since all the transitions with negative values are impossible, E -> X for example, going from the end of a word to the middle of the following one. 这看起来不错,因为所有带有负值的转换都是不可能的,例如,E-> X,从单词的结尾到下一个单词的中间。 S -> E gets has the highest value, and as seen above the model simply gets into a pattern of labeling S then E repeatedly until the sentence ends. S-> E gets具有最高值,并且如上图所示,该模型只是反复进入依次标记为S和E的模式,直到句子结束。 I followed this demo when trying this, though that demo is for separating Latin. 尝试此演示时,我遵循了此演示 ,尽管该演示用于分隔拉丁语。 My features are similarly just n-grams: 我的特征类似地只是n-gram:

['bias',
 'char=ま',
 '-2-gram=さま',
 '-3-gram=はさま',
 '-4-gram=にはさま',
 '-5-gram=語にはさま',
 '-6-gram=言語にはさま',
 '2-gram=まざ',
 '3-gram=まざま',
 '4-gram=まざまな',
 '5-gram=まざまな分',
 '6-gram=まざまな分類']

I've tried changing labels to just S and X for start and other, but this just causes the model to repeat S,X,S,X till it runs out of characters. 我尝试过将标签更改为开始和其他位置的S和X,但这只会导致模型重复S,X,S,X,直到用完字符为止。 I've gone up to 6-grams in both directions which took a lot longer but didn't change anything. 我在两个方向上的重量都增加了6克,这花费了更长的时间,但没有改变任何东西。 Tried training for more iterations and changing the L1 and L2 constants a bit. 尝试训练更多迭代并稍微更改L1和L2常数。 I've trained on up to 100,000 sentences which is about as far as I can go as it takes almost all 16GB of my ram to do so. 我已经接受了多达100,000个句子的培训,这几乎是我所能做的,因为我几乎要用掉所有16GB的内存。 Are my features structured wrong? 我的功能结构是否错误? How do I get the model to stop guessing in a pattern, is that even what's happening? 我如何使模型停止以某种模式进行猜测,即使发生了什么? Help would be appreciated, and let me know if I need to add more info to the question. 非常感谢您的帮助,如果需要向问题添加更多信息,请告诉我。

Turns out I was missing a step. 原来我错过了一步。 I was passing raw sentences to the tagger rather than passing features, because the CRF can apparently accept character strings as if it were a list of almost featureless entries it was just defaulting to guessing the highest rated transition rather than raising an error. 我将原始句子传递给标记器而不是传递特征,因为CRF显然可以接受字符串,就好像它是几乎没有特征的条目的列表一样,它只是默认猜测最高等级的过渡,而不是引发错误。 I'm not sure if this will help anyone else given it was a stupid mistake but I'll put an answer here until I decide whether or not I want to remove the question. 我不确定这是否会帮助其他人,因为这是一个愚蠢的错误,但是我会在这里回答,直到我决定是否要删除此问题为止。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM