[英]What features could help to classify the end of sentence? Sequence classification
I have pairs of sentences that lack a period and a capitalized letter in between them.我有成对的句子,它们之间缺少句点和大写字母。 Need to segment them from each other.
需要将它们彼此分开。 I'm looking for some help in picking the good features to improve the model.
我正在寻找一些帮助来选择好的特征来改进模型。
I'm using pycrfsuite
to perform sequence classification and find the end of the first sentence, like so:我正在使用
pycrfsuite
执行序列分类并找到第一句话的结尾,如下所示:
From brown corpus, I join every two sentences together and get their pos tags.从棕色语料库中,我将每两个句子连接在一起并获得它们的 pos 标签。 Then, I label every token in the sentence with
'S'
if the space follows it and 'P'
if the period follows it in the sentence.然后,如果后面有空格,我用
'S'
标记句子中的每个标记,如果句号在句子中跟随它,则用'P'
标记。 Then I delete a period between the sentences, and lower the following token.然后我删除句子之间的句点,并降低以下标记。 I get something like this:
我得到这样的东西:
Input:输入:
data = ['I love Harry Potter.', 'It is my favorite book.']
Output:输出:
sent = [('I', 'PRP'), ('love', 'VBP'), ('Harry', 'NNP'), ('Potter', 'NNP'), ('it', 'PRP'), ('is', 'VBZ'), ('my', 'PRP$'), ('favorite', 'JJ'), ('book', 'NN')]
labels = ['S', 'S', 'S', 'P', 'S', 'S', 'S', 'S', 'S']
At the moment, I extract these general features:目前,我提取了这些一般特征:
def word2features2(sent, i):
word = sent[i][0]
postag = sent[i][1]
# Common features for all words
features = [
'bias',
'word.lower=' + word.lower(),
'word[-3:]=' + word[-3:],
'word[-2:]=' + word[-2:],
'word.isupper=%s' % word.isupper(),
'word.isdigit=%s' % word.isdigit(),
'postag=' + postag
]
# Features for words that are not
# at the beginning of a document
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.extend([
'-1:word.lower=' + word1.lower(),
'-1:word.isupper=%s' % word1.isupper(),
'-1:word.isdigit=%s' % word1.isdigit(),
'-1:postag=' + postag1
])
else:
# Indicate that it is the 'beginning of a sentence'
features.append('BOS')
# Features for words that are not
# at the end of a document
if i < len(sent)-1:
word1 = sent[i+1][0]
postag1 = sent[i+1][1]
features.extend([
'+1:word.lower=' + word1.lower(),
'+1:word.isupper=%s' % word1.isupper(),
'+1:word.isdigit=%s' % word1.isdigit(),
'+1:postag=' + postag1
])
else:
# Indicate that it is the 'end of a sentence'
features.append('EOS')
And train crf with these parameters:并使用这些参数训练 crf:
trainer = pycrfsuite.Trainer(verbose=True)
# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
trainer.append(xseq, yseq)
# Set the parameters of the model
trainer.set_params({
# coefficient for L1 penalty
'c1': 0.1,
# coefficient for L2 penalty
'c2': 0.01,
# maximum number of iterations
'max_iterations': 200,
# whether to include transitions that
# are possible, but not observed
'feature.possible_transitions': True
})
trainer.train('crf.model')
Accuracy report shows:精度报告显示:
precision recall f1-score support
S 0.99 1.00 0.99 214627
P 0.81 0.57 0.67 5734
micro avg 0.99 0.99 0.99 220361
macro avg 0.90 0.79 0.83 220361
weighted avg 0.98 0.99 0.98 220361
What are some ways I could edit word2features2()
in order to improve the model?我可以通过哪些方式编辑
word2features2()
以改进模型? (or any other part) (或任何其他部分)
Here is the link to the full code as it is today.这是今天完整代码的链接。
Also, I am just a beginner in nlp so I would greatly appreciate any feedback overall, links to relevant or helpful sources, and rather simple explanations.另外,我只是在NLP初学者,所以我将不胜感激任何反馈整体,链接到相关或有帮助的来源,而不是简单的解释。 Thank you very-very much!
非常非常感谢你!
Since your classes are very imbalanced due to the nature of the problem, I would suggest using weighted loss, where the loss for the P tag is given a higher value than those of the S class.由于问题的性质,您的类非常不平衡,我建议使用加权损失,其中 P 标签的损失值高于 S 类的值。 I think the problem might be that due to the equivalent weight of both classes, the classifier not give enough attention to those P tags since their effect on the loss is very small.
我认为问题可能是由于两个类的权重相同,分类器没有对那些 P 标签给予足够的关注,因为它们对损失的影响非常小。
Another thing you could try is hyper-parameter tuning, make sure to optimize for the macro f1-score then, since it will give equal weights to both classes regardless of the number of support instances.您可以尝试的另一件事是超参数调整,然后确保针对宏 f1-score 进行优化,因为无论支持实例的数量如何,它都会为两个类提供相同的权重。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.