简体   繁体   中英

Python-crfsuite labeling in fixed pattern

I'm trying to create a CRF model that segments Japanese sentences into words. At the moment I'm not worried about perfect results as it's just a test. The training goes fine but when it's finished it always gives the same guess for every sentence I try to tag.

"""Labels: X: Character is mid word, S: Character starts a word, E:Character ends a word, O: One character word"""
    Sentence:広辞苑や大辞泉には次のようにある。
    Prediction:['S', 'X', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E']
    Truth:['S', 'X', 'E', 'O', 'S', 'X', 'E', 'O', 'O', 'O', 'O', 'S', 'E', 'O', 'S', 'E', 'O']
    Sentence:他にも、言語にはさまざまな分類がある。
    Prediction:['S', 'X', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E']
    Truth:['O', 'O', 'O', 'O', 'S', 'E', 'O', 'O', 'S', 'X', 'X', 'X', 'E', 'S', 'E', 'O', 'S', 'E', 'O']

When looking at the transition info for the model:

{('E', 'E'): -3.820618,
 ('E', 'O'): 3.414133,
 ('E', 'S'): 2.817927,
 ('E', 'X'): -3.056175,
 ('O', 'E'): -4.249522,
 ('O', 'O'): 2.583123,
 ('O', 'S'): 2.601341,
 ('O', 'X'): -4.322003,
 ('S', 'E'): 7.05034,
 ('S', 'O'): -4.817578,
 ('S', 'S'): -4.400028,
 ('S', 'X'): 6.104851,
 ('X', 'E'): 4.985887,
 ('X', 'O'): -5.141898,
 ('X', 'S'): -4.499069,
 ('X', 'X'): 4.749289}

This looks good since all the transitions with negative values are impossible, E -> X for example, going from the end of a word to the middle of the following one. S -> E gets has the highest value, and as seen above the model simply gets into a pattern of labeling S then E repeatedly until the sentence ends. I followed this demo when trying this, though that demo is for separating Latin. My features are similarly just n-grams:

['bias',
 'char=ま',
 '-2-gram=さま',
 '-3-gram=はさま',
 '-4-gram=にはさま',
 '-5-gram=語にはさま',
 '-6-gram=言語にはさま',
 '2-gram=まざ',
 '3-gram=まざま',
 '4-gram=まざまな',
 '5-gram=まざまな分',
 '6-gram=まざまな分類']

I've tried changing labels to just S and X for start and other, but this just causes the model to repeat S,X,S,X till it runs out of characters. I've gone up to 6-grams in both directions which took a lot longer but didn't change anything. Tried training for more iterations and changing the L1 and L2 constants a bit. I've trained on up to 100,000 sentences which is about as far as I can go as it takes almost all 16GB of my ram to do so. Are my features structured wrong? How do I get the model to stop guessing in a pattern, is that even what's happening? Help would be appreciated, and let me know if I need to add more info to the question.

Turns out I was missing a step. I was passing raw sentences to the tagger rather than passing features, because the CRF can apparently accept character strings as if it were a list of almost featureless entries it was just defaulting to guessing the highest rated transition rather than raising an error. I'm not sure if this will help anyone else given it was a stupid mistake but I'll put an answer here until I decide whether or not I want to remove the question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM