CRFSuite how much training data?

Question

Hi I am training a crf using crfsuite with some sample data that I have using Latin text. I tagged the training data with O, PERSON and PLACE. When test my trained model I am getting everything predicted as O. I am suspecting that this is because I do not have enough training data. My training is based on 3760 bytes. (I know it is a little!- will it make CRF not work?)

    def word2features2(sent, i):
    word = sent[i][1] #getting the word token
    #a dict of features per word
    features = [
        #features of current token
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:], #substrings
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit()
    ]
    if i > 0: #if the sentence is composed of more than one word
        word1 = sent[i-1][1] #get features of previous word
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper()
        ])
    else:
        features.append('BOS') #in case it is the first word in the sentence - Beginning of Sentence

    if i < len(sent)-1: #if the end of the sentence is not reached
        word1 = sent[i+1][1] #get the features of the next word
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper()
        ])
    else:
        features.append('EOS') #in case it is the last word in the sentence - End of Sentence

    return features

    #each sentence is passed through the feature functions
def get_features(sent):
    return [word2features2(sent, i) for i in range(len(sent))]

    #get the POS/NER tags for each token in a sentence
def get_tags(sent):
    return [tag for tag, token in sent]

X_train = [get_features(s) for s in TRAIN_DATA]
y_train = [get_tags(s) for s in TRAIN_DATA]

    crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    c1=0.1,
    c2=0.1,
    all_possible_transitions=False
    )

    crf.fit(X_train, y_train)

    text4 = 'Azar Nifusi Judeus de civitate Malte presens etc. non vi sed sponte etc. incabellavit et ad cabellam habere concessit ac dedit, tradidit et assignavit Nicolao Delia et Lemo suo filio presentibus etc. terras ipsius Azar vocatas Ta Xellule et Ginen Chagem in contrata Deyr Issafisaf cum iuribus suis omnibus <etc.> pro annis decem continuo sequturis numerandis a medietate mensis Augusti primo preteriti in antea pro salmis octo frumenti <sue> pro qualibet ayra provenientibus ex dictis terris \ad racionem, videlicet, de salmis sexdecim/ quas salmas octo frumenti in qualibet ayra \dicti cabelloti/ promiserunt dare et assignare prefato <Nicol.> Azar et eciam dicti cabelloti anno quolibet promiserunt et tenentur eidem Azar dare et deferre cum eiusdem Azar somerio salmas decem spinarum ac eciam prefat cabelloti promiserunt eidem Azar in qualibet ayra provenient[ium] ex dictis terris dare duas salmas palie in ayra et dictus <cabellotus promisit> Azar promisit eisdem cabellotis suis non spoliantur de dicta cabella neque via alienacionis neque alia quavis via [f. 5v / p. 8] et eciam promisit suis expensis dictas terras circumdare muro et dicti cabellotis tenentur in medio ipsius dicte ingabellationis dare \dicto Azar pro causa predicta/ dimidiam salmam frumenti et eciam promisit durantibus dictis annis decem dictas terras non reincabellare alicui persone et eciam tenentur revidere et curatareb ad circumfaciendas dictas terras \muro/ ad expensas tamen dicti Judei. Que omnia etc. Promiserunt etc. Obligando etc. Renunciando etc. Unde etc.'

    y_pred = crf.predict(text4)

Answer 1

Well, just like with any machine learning model, a very small training set will lead to underfitting. That could be what's happening here. Although, everything predicted the same value suggests to me some error in the code itself.

def get_features(sent):
    return [word2features2(sent, i) for i in range(len(sent))]

X_train = [get_features(s) for s in TRAIN_DATA]

So here it looks like you're passing the length of each word as "i" in your word2features2 function. I think you're probably wanting to pass the sentence in as a list of words, so try

def get_features(sent):
    word_list = sent.split(" ")
    return [word2features2(word_list, i) for i in range(len(sent))]

I'm assuming your training data is a list of sentences in this case, and not a list of a list of words like

train_data = ['this is a sentence', 'this is also a sentence'] <= yours
train_data = [['this','is','a','sentence'],['this','is','also','a','sentence]] <= not yours

To be fair, I don't really know what your training data looks like, so the

word = sent[i][1]

Line looks a bit fishy to me as well.

CRFSuite how much training data?

Question

1 answers

solution1
0 2018-08-13 16:52:21

CRFSuite how much training data?

Question

1 answers

solution1 0 2018-08-13 16:52:21

solution1
0 2018-08-13 16:52:21