[英]What should be the format of training data for Stanford NER CRF Classifier?
I am trying to train my own Address classifier model using Stanford CRF-NER
but the performance is very low.我正在尝试使用斯坦福
CRF-NER
训练我自己的地址分类器模型,但性能非常低。 I am confused about the format of the training data I have trained with.我对我训练过的训练数据的格式感到困惑。 The training data is typically the list of districts, cities, provinces and their respective labels.
训练数据通常是地区、城市、省及其各自标签的列表。 But the model is not tagging the respective address tags to its tokens.
但是该模型并未将相应的地址标签标记到其令牌中。
The format of the training data is as below:训练数据的格式如下:
This is the just a sample of training data in csv format, There are 3 labels PROVINCE, REGENCY and DISTRICT这只是一个 csv 格式的训练数据样本,有 3 个标签PROVINCE、REGENCY 和 DISTRICT
Here is the output of tagged tokens:这是标记令牌的输出:
You can all tokens has been tagged as DISTRICT though I have REGENCY, DISTRICT AND PROVINCE as labelled data.尽管我将 REGENCY、DISTRICT 和 PROVINCE 作为标记数据,但您可以将所有令牌都标记为 DISTRICT。
I wanted to know if my format of training data is correct is only works on contextual data at sentence level Since I saw Stanford NER
working well on sentence level.我想知道我的训练数据格式是否正确仅适用于句子级别的上下文数据,因为我看到斯坦福
NER
在句子级别上运行良好。
Since you're trying to make an address classifier, I would suggest you train your model with actual (tagged) addresses and not a dictionary comprising a list of Regency, District, Province.由于您正在尝试制作地址分类器,因此我建议您使用实际(标记)地址而不是包含 Regency、District、Province 列表的字典来训练您的模型。 CRF will then be able to take contextual information into account when trying to tag it depending on the features you've configured.
然后,CRF 将能够在尝试根据您配置的功能对其进行标记时考虑上下文信息。
You use CoNLL style data to train a CRF.您使用 CoNLL 样式数据来训练 CRF。
-DOCSTART- O
5461 O
North O
Ave O
Miami DISTRICT
Florida PROVINCE
88754 O
8888 O
South O
Drive O
Miami DISTRICT
Florida PROVINCE
99965 O
More proper use of the list of Districts, Provinces will be as a Gazette.更恰当地使用地区列表,省会作为公报。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.