简体   繁体   中英

How to prepare training corpus for CRF model using CRFSuite

I need data in the following format

(u'Melbourne', u'NP', u'B-LOC'),
 (u'(', u'Fpa', u'O'),
 (u'Australia', u'NP', u'B-LOC'),
 (u')', u'Fpt', u'O'),
 (u',', u'Fc', u'O'),

What i have is just txt file, I need this data for CRF model for NER task. I`m planning to use crf suite for python, but cant quite understand how to label training data. I can just pos-tag it, but how to add named entities, cause i need to label training data with 2 custom labels.

If you want to train a CRF model then you need annotated data; for some tasks it is possible to rely on existing corpora, but if your task is new then you'll have to annotate entities yourselves. There are tools which can help, eg take a look at http://brat.nlplab.org/ . GATE also has annotation tool built-in.

POS tags are often used as features, but they are not strictly required (and you should use many other features as well).

如果您想使用不同的实体(而不只是Location实体或Person实体)创建自己的训练数据,那么可以参考我的答案。 是否可以训练斯坦福大学NER系统以识别更多命名实体类型?

Brat is an excellent way to annotate your new dataset. After annotating it, there needs to be a conversion from Standoff format that Brat outputs to the format that Stanford NER accepts.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM