![](/img/trans.png)
[英]How to use word embedding as features for CRF (sklearn-crfsuite) model training
[英]sklearn_crfsuite.CRF UnicodeEncodeError
我正在尝试使用sklearn_crfsuite.CRF
和 ner_dataset 训练中文 NER model 。 在我清理数据集并拟合 model 后,它显示错误消息:
60loading training data to CRFsuite: 0%| | 0/700 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main_script.py", line 22, in <module>
crf_pipeline.model.fit(x_train, y_train, x_test, y_test)
File "C:\Users\weber\PycharmProjects\demo-insurance-backend\venv\lib\site-packages\sklearn_crfsuite\estimator.py", line 314, in fit
trainer.append(xseq, yseq)
File "pycrfsuite\_pycrfsuite.pyx", line 312, in pycrfsuite._pycrfsuite.BaseTrainer.append
File "stringsource", line 48, in vector.from_py.__pyx_convert_vector_from_py_std_3a__3a_string
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-6: ordinal not in range(128)
数据格式为.txt
,用\n
分隔, OriginalText
存储文本数据, entities
存储实体信息。
下面是我预处理数据集的代码:
import ast
from opencc import OpenCC
import sklearn_crfsuite
from sklearn.model_selection import train_test_split
from tqdm import tqdm
tag_dictionary = {
'影像檢查': 'I-影像檢查',
'手術': 'S-手術',
'實驗室檢驗': 'E-實驗室檢驗',
'解剖部位': 'B-解剖部位',
'疾病和診斷': 'D-疾病和診斷'
}
def check_entity(entities):
return [
entity
for entity in entities
if entity['label_type'] in tag_dictionary
]
def build_tag_seq(text, entities):
tag_list = ['O' for token in text]
for entity in entities:
if tag_dictionary is None:
tag = entity['label_type']
else:
tag = tag_dictionary[entity['label_type']]
tag_list[entity['start_pos']] = f'{tag}-B'
for i in range(entity['start_pos']+1, entity['end_pos']):
tag_list[i] = f'{tag}-I'
return tag_list
def data_coverter(data):
cc = OpenCC('s2t') # 轉繁體
data_dict = ast.literal_eval(cc.convert(data)) # txt轉dict
return data_dict
def process_data(data):
data_dict = data_coverter(data)
text = data_dict['originalText']
entities = data_dict['entities']
entities = check_entity(entities)
tag_seq = build_tag_seq(text, entities)
return text, tag_seq
def load_txt_data(stop=-1):
data_x = list() # 內文(token序列)
data_y = list() # 每個token的對應tag序列
for path in ['subtask1_training_part1.txt']:
with open(path, 'r', encoding='utf-8') as f:
for i, line in tqdm(enumerate(f.readlines())):
text = line.strip()
if len(text) > 3:
temp_x, temp_y = process_data(text)
data_x.append(temp_x)
data_y.append(temp_y)
if i == stop:
break
return data_x, data_y
x, y = load_txt_data()
model = sklearn_crfsuite.CRF(
algorithm='l2sgd',
c2=1.0,
max_iterations=1000,
all_possible_transitions=True,
all_possible_states=True,
verbose=True
)
model.fit(x, y)
以下是我使用的 pkgs 列表:
pip install opencc sklearn sklearn_crfsuite
有没有人之前收到类似的错误消息并解决了它? 请,任何帮助将不胜感激。
我发现我不能在参考资料中使用 NER 标签中的中文符号。
在值中用int
更改tag_dictionary
后,它起作用了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.