簡體   English   中英

CRFSuite有多少培訓數據?

[英]CRFSuite how much training data?

嗨,我正在使用crfsuite訓練帶有我使用拉丁文本的一些示例數據的crf。 我用O,PERSON和PLACE標記了訓練數據。 測試我訓練有素的模型時,我得到的所有預測值都為O。我懷疑這是因為我沒有足夠的訓練數據。 我的訓練是基於3760字節。 (我知道這有點!-它會使CRF無法正常工作嗎?)

    def word2features2(sent, i):
    word = sent[i][1] #getting the word token
    #a dict of features per word
    features = [
        #features of current token
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:], #substrings
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit()
    ]
    if i > 0: #if the sentence is composed of more than one word
        word1 = sent[i-1][1] #get features of previous word
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper()
        ])
    else:
        features.append('BOS') #in case it is the first word in the sentence - Beginning of Sentence

    if i < len(sent)-1: #if the end of the sentence is not reached
        word1 = sent[i+1][1] #get the features of the next word
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper()
        ])
    else:
        features.append('EOS') #in case it is the last word in the sentence - End of Sentence

    return features

    #each sentence is passed through the feature functions
def get_features(sent):
    return [word2features2(sent, i) for i in range(len(sent))]

    #get the POS/NER tags for each token in a sentence
def get_tags(sent):
    return [tag for tag, token in sent]

X_train = [get_features(s) for s in TRAIN_DATA]
y_train = [get_tags(s) for s in TRAIN_DATA]

    crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    c1=0.1,
    c2=0.1,
    all_possible_transitions=False
    )

    crf.fit(X_train, y_train)

    text4 = 'Azar Nifusi Judeus de civitate Malte presens etc. non vi sed sponte etc. incabellavit et ad cabellam habere concessit ac dedit, tradidit et assignavit Nicolao Delia et Lemo suo filio presentibus etc. terras ipsius Azar vocatas Ta Xellule et Ginen Chagem in contrata Deyr Issafisaf cum iuribus suis omnibus <etc.> pro annis decem continuo sequturis numerandis a medietate mensis Augusti primo preteriti in antea pro salmis octo frumenti <sue> pro qualibet ayra provenientibus ex dictis terris \ad racionem, videlicet, de salmis sexdecim/ quas salmas octo frumenti in qualibet ayra \dicti cabelloti/ promiserunt dare et assignare prefato <Nicol.> Azar et eciam dicti cabelloti anno quolibet promiserunt et tenentur eidem Azar dare et deferre cum eiusdem Azar somerio salmas decem spinarum ac eciam prefat cabelloti promiserunt eidem Azar in qualibet ayra provenient[ium] ex dictis terris dare duas salmas palie in ayra et dictus <cabellotus promisit> Azar promisit eisdem cabellotis suis non spoliantur de dicta cabella neque via alienacionis neque alia quavis via [f. 5v / p. 8] et eciam promisit suis expensis dictas terras circumdare muro et dicti cabellotis tenentur in medio ipsius dicte ingabellationis dare \dicto Azar pro causa predicta/ dimidiam salmam frumenti et eciam promisit durantibus dictis annis decem dictas terras non reincabellare alicui persone et eciam tenentur revidere et curatareb ad circumfaciendas dictas terras \muro/ ad expensas tamen dicti Judei. Que omnia etc. Promiserunt etc. Obligando etc. Renunciando etc. Unde etc.'

    y_pred = crf.predict(text4)

好吧,就像任何機器學習模型一樣,非常小的訓練集也會導致擬合不足。 那可能就是這里發生的事情。 雖然,所有預測相同的值對我來說都暗示了代碼本身有一些錯誤。

def get_features(sent):
    return [word2features2(sent, i) for i in range(len(sent))]

X_train = [get_features(s) for s in TRAIN_DATA]

因此,這里看起來好像您在word2features2函數中將每個單詞的長度作為“ i”傳遞。 我認為您可能希望將句子作為單詞列表傳遞,因此請嘗試

def get_features(sent):
    word_list = sent.split(" ")
    return [word2features2(word_list, i) for i in range(len(sent))]

在這種情況下,我假設您的訓練數據是一個句子列表,而不是像這樣的單詞列表

train_data = ['this is a sentence', 'this is also a sentence'] <= yours
train_data = [['this','is','a','sentence'],['this','is','also','a','sentence]] <= not yours

公平地說,我真的不知道您的訓練數據是什么樣子,所以

word = sent[i][1]

線對我來說也有點可疑。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM