简体   繁体   English

crfsuite中文本特征的数值转换

[英]Numeric conversion of textual features in crfsuite

I was looking at the example code provided in the docs of crfsuite-python and it has the following code for feature defining.我正在查看 crfsuite-python 文档中提供的示例代码,它具有以下用于特征定义的代码。

def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]

features = [
    'bias',
    'word.lower=' + word.lower(),
    'word[-3:]=' + word[-3:],
    'word[-2:]=' + word[-2:],
    'word.isupper=%s' % word.isupper(),
    'word.istitle=%s' % word.istitle(),
    'word.isdigit=%s' % word.isdigit(),
    'postag=' + postag,
    'postag[:2]=' + postag[:2],
]
if i > 0:
    word1 = sent[i-1][0]
    postag1 = sent[i-1][1]
    features.extend([
        '-1:word.lower=' + word1.lower(),
        '-1:word.istitle=%s' % word1.istitle(),
        '-1:word.isupper=%s' % word1.isupper(),
        '-1:postag=' + postag1,
        '-1:postag[:2]=' + postag1[:2],
    ])
else:
    features.append('BOS')
    
if i < len(sent)-1:
    word1 = sent[i+1][0]
    postag1 = sent[i+1][1]
    features.extend([
        '+1:word.lower=' + word1.lower(),
        '+1:word.istitle=%s' % word1.istitle(),
        '+1:word.isupper=%s' % word1.isupper(),
        '+1:postag=' + postag1,
        '+1:postag[:2]=' + postag1[:2],
    ])
else:
    features.append('EOS')
            
return features

I understand that features such as isupper() can be either 0 or 1 but for features such as word[-2:] which are characters,how are they converted to numeric terms?我知道诸如 isupper() 之类的特征可以是 0 或 1,但是对于诸如 word[-2:] 之类的特征,它们是字符,它们如何转换为数字项?

CRF trains upon sequence of input data to learn transitions from one state (label) to another. CRF 对输入数据序列进行训练,以学习从一个 state(标签)到另一个的转换。 To enable such an algorithm, we need to define features which take into account different transitions.为了启用这样的算法,我们需要定义考虑到不同转换的特征。 In the function word2features() below, we transform each word into a feature dictionary depicting the following attributes or features:在下面的 function word2features() 中,我们将每个单词转换为描述以下属性或特征的特征字典:

lower case of word
suffix containing last 3 characters
suffix containing last 2 characters
flags to determine upper-case, title-case, numeric data and POS tag

We also attach attributes related to previous and next words or tags to determine beginning of sentence (BOS) or end of sentence (EOS)我们还附加与前一个和下一个单词或标签相关的属性,以确定句子的开头 (BOS) 或句子的结尾 (EOS)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM