简体   繁体   English

如何使用pycrfsuite在两个数据集上训练CRF?

[英]How can I train a CRF on two datasets with pycrfsuite?

I have two datasets: dataset A and dataset B. I want to use pycrfsuite to train a conditional random field (CRF) on dataset A, then train the CRF on dataset B. Is it possible to achieve that with pycrfsuite? 我有两个数据集:数据集A和数据集B.我想使用pycrfsuite在数据集A上训练条件随机字段(CRF),然后在数据集B上训练CRF。是否可以通过pycrfsuite实现这一点?

I do not want to join the CRF on two datasets at the same time. 我不想同时在两个数据集上加入CRF。

I know how to train a CRF on one dataset with pycrfsuite: https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb : 我知道如何使用pycrfsuite在一个数据集上训练CRF: https//github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb

'''Tested with python 2.7 64-bit
Code from https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb
sudo pip install nltk python-crfsuite scikit-learn
sudo python -m nltk.downloader conl2002
'''
from __future__ import print_function
from __future__ import division

from itertools import chain
import nltk
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pycrfsuite
import time

print(sklearn.__version__)
nltk.corpus.conll2002.fileids()

train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag,
        'postag[:2]=' + postag[:2],
    ]
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:postag=' + postag1,
            '-1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('BOS')

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:postag=' + postag1,
            '+1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('EOS')

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]


def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.

    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))

    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}

    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
        )

def main():
    '''
    This is the main function
    '''
    feature_extraction_start_time = time.time()
    X_train = [sent2features(s) for s in train_sents]
    y_train = [sent2labels(s) for s in train_sents]

    X_test = [sent2features(s) for s in test_sents]
    y_test = [sent2labels(s) for s in test_sents]

    feature_extraction_elapsed_time = time.time() - feature_extraction_start_time
    print('feature_extraction_elapsed_time: {0:.2f} seconds'.format(feature_extraction_elapsed_time))

    trainer = pycrfsuite.Trainer(verbose=False)

    for xseq, yseq in zip(X_train, y_train):
        trainer.append(xseq, yseq)
        #break

    trainer.set_params({
        'c1': 1.0,   # coefficient for L1 penalty
        'c2': 1e-3,  # coefficient for L2 penalty
        'max_iterations': 50,  # stop earlier

        # include transitions that are possible, but not observed
        'feature.possible_transitions': True
    })

    training_start_time = time.time()
    trainer.train('conll2002-esp.crfsuite')
    training_elapsed_time = time.time() - training_start_time
    print('training_elapsed_time: {0:.2f} seconds'.format(training_elapsed_time))

    print(len(trainer.logparser.iterations))
    print(trainer.logparser.iterations[-1])

    test_start_time = time.time()

    tagger = pycrfsuite.Tagger()
    tagger.open('conll2002-esp.crfsuite')

    y_pred = [tagger.tag(xseq) for xseq in X_test]
    print(bio_classification_report(y_test, y_pred))


    example_sent = test_sents[0]
    print(' '.join(sent2tokens(example_sent)), end='\n\n')

    print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
    print("Correct:  ", ' '.join(sent2labels(example_sent)))

    test_elapsed_training_time = time.time() - test_start_time
    print('test_elapsed_training_time: {0:.2f} seconds'.format(test_elapsed_training_time))


if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

I just don't know how to train it on a second dataset, as trainer.train() resets the parameters of the CRF. 我只是不知道如何在第二个数据集上训练它,因为trainer.train()重置了CRF的参数。

This is impossible. 这是不可能的。 One of the two creators of python-crfsuite wrote on https://github.com/scrapinghub/python-crfsuite/issues/12 ( mirror ): python-crfsuite的两个创建者之一写于https://github.com/scrapinghub/python-crfsuite/issues/12镜像 ):

Do you want to continue training from the point model was save at? 你想从点模型继续训练保存吗? I don't think it is possible with CRFsuite, at least with its public API (which python-crfsuite uses). 我认为CRFsuite不可能,至少使用它的公共API(python-crfsuite使用)。 It may be possible by using some internal functions of CRFsuite, but I haven't tried it. 通过使用CRFsuite的一些内部功能可能是可能的,但我还没有尝试过。 https://github.com/Jekub/Wapiti can do that; https://github.com/Jekub/Wapiti可以做到这一点; it has other limitations though. 但它有其他局限性。

Actually you need to train the node (unary) classifiers for your CRF. 实际上,您需要为CRF训练节点(一元)分类器。 There may be a lot of possibilities with other CRF packages, like DGM Library : 其他CRF软件包可能有很多可能性,例如DGM Library

  1. You train two node trainers on different datasets: 您在不同的数据集上训练两个节点训练器:

    CTrainNode *pTrainerA = new CTrainNodeXXX(); CTrainNode * pTrainerA = new CTrainNodeXXX();

    CTrainNode *pTrainerB = new CTrainNodeYYY(); CTrainNode * pTrainerB = new CTrainNodeYYY();

(where XXX and YYY are the names of the probabilistic models, you wish to use) (其中XXX和YYY是概率模型的名称,您希望使用)

  1. In the testing phase you can use the outcome of trainers A and B jointly or separately. 在测试阶段,您可以联合或单独使用培训师A和B的结果。

cv::Mat potentialA = pTrainerA->getnodePotentials(sample); cv :: Mat potentialA = pTrainerA-> getnodePotentials(sample);

cv::Mat potentialB = pTrainerB->getNodePotentials(sample); cv :: Mat potentialB = pTrainerB-> getNodePotentials(sample);

  1. Depending on your task, you can either concatenete the results potentialA or potentialB, or use one of them, or use the class with the highest potential (probability) 根据您的任务,您可以连接结果potentialA或potentialB,或使用其中一个,或使用具有最高潜力(概率)的类

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM