哪些特征可以帮助对句尾进行分类？序列分类

Question

Problem:问题：

I have pairs of sentences that lack a period and a capitalized letter in between them.我有成对的句子，它们之间缺少句点和大写字母。 Need to segment them from each other.需要将它们彼此分开。 I'm looking for some help in picking the good features to improve the model.我正在寻找一些帮助来选择好的特征来改进模型。

Background:背景：

I'm using pycrfsuite to perform sequence classification and find the end of the first sentence, like so:我正在使用pycrfsuite执行序列分类并找到第一句话的结尾，如下所示：

From brown corpus, I join every two sentences together and get their pos tags.从棕色语料库中，我将每两个句子连接在一起并获得它们的 pos 标签。 Then, I label every token in the sentence with 'S' if the space follows it and 'P' if the period follows it in the sentence.然后，如果后面有空格，我用'S'标记句子中的每个标记，如果句号在句子中跟随它，则用'P'标记。 Then I delete a period between the sentences, and lower the following token.然后我删除句子之间的句点，并降低以下标记。 I get something like this:我得到这样的东西：

Input:输入：

data = ['I love Harry Potter.', 'It is my favorite book.']

Output:输出：

sent = [('I', 'PRP'), ('love', 'VBP'), ('Harry', 'NNP'), ('Potter', 'NNP'), ('it', 'PRP'), ('is', 'VBZ'), ('my', 'PRP$'), ('favorite', 'JJ'), ('book', 'NN')]
labels = ['S', 'S', 'S', 'P', 'S', 'S', 'S', 'S', 'S']

At the moment, I extract these general features:目前，我提取了这些一般特征：

def word2features2(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    # Common features for all words
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag
    ]

    # Features for words that are not
    # at the beginning of a document
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.isdigit=%s' % word1.isdigit(),
            '-1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'beginning of a sentence'
        features.append('BOS')

    # Features for words that are not
    # at the end of a document
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.isdigit=%s' % word1.isdigit(),
            '+1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'end of a sentence'
        features.append('EOS')

And train crf with these parameters:并使用这些参数训练 crf：

    trainer = pycrfsuite.Trainer(verbose=True)

    # Submit training data to the trainer
    for xseq, yseq in zip(X_train, y_train):
        trainer.append(xseq, yseq)

    # Set the parameters of the model
    trainer.set_params({
        # coefficient for L1 penalty
        'c1': 0.1,

        # coefficient for L2 penalty
        'c2': 0.01,

        # maximum number of iterations
        'max_iterations': 200,

        # whether to include transitions that
        # are possible, but not observed
        'feature.possible_transitions': True
    })

    trainer.train('crf.model')

Results:结果：

Accuracy report shows:精度报告显示：

              precision    recall  f1-score   support

           S       0.99      1.00      0.99    214627
           P       0.81      0.57      0.67      5734

   micro avg       0.99      0.99      0.99    220361
   macro avg       0.90      0.79      0.83    220361
weighted avg       0.98      0.99      0.98    220361

What are some ways I could edit word2features2() in order to improve the model?我可以通过哪些方式编辑word2features2()以改进模型？ (or any other part) （或任何其他部分）

Here is the link to the full code as it is today.这是今天完整代码的链接。

Also, I am just a beginner in nlp so I would greatly appreciate any feedback overall, links to relevant or helpful sources, and rather simple explanations.另外，我只是在NLP初学者，所以我将不胜感激任何反馈整体，链接到相关或有帮助的来源，而不是简单的解释。 Thank you very-very much!非常非常感谢你！

Answer 1

Since your classes are very imbalanced due to the nature of the problem, I would suggest using weighted loss, where the loss for the P tag is given a higher value than those of the S class.由于问题的性质，您的类非常不平衡，我建议使用加权损失，其中 P 标签的损失值高于 S 类的值。 I think the problem might be that due to the equivalent weight of both classes, the classifier not give enough attention to those P tags since their effect on the loss is very small.我认为问题可能是由于两个类的权重相同，分类器没有对那些 P 标签给予足够的关注，因为它们对损失的影响非常小。

Another thing you could try is hyper-parameter tuning, make sure to optimize for the macro f1-score then, since it will give equal weights to both classes regardless of the number of support instances.您可以尝试的另一件事是超参数调整，然后确保针对宏 f1-score 进行优化，因为无论支持实例的数量如何，它都会为两个类提供相同的权重。

哪些特征可以帮助对句尾进行分类？序列分类

问题描述

Problem:问题：

Background:背景：

Results:结果：

1 个解决方案

解决方案1
1 2019-04-15 07:31:13

哪些特征可以帮助对句尾进行分类？ 序列分类

问题描述

Problem:问题：

Background:背景：

Results:结果：

1 个解决方案

解决方案1 1 2019-04-15 07:31:13

哪些特征可以帮助对句尾进行分类？序列分类

解决方案1
1 2019-04-15 07:31:13