How to prepare training corpus for CRF model using CRFSuite

Question

I need data in the following format

(u'Melbourne', u'NP', u'B-LOC'),
 (u'(', u'Fpa', u'O'),
 (u'Australia', u'NP', u'B-LOC'),
 (u')', u'Fpt', u'O'),
 (u',', u'Fc', u'O'),

What i have is just txt file, I need this data for CRF model for NER task. I`m planning to use crf suite for python, but cant quite understand how to label training data. I can just pos-tag it, but how to add named entities, cause i need to label training data with 2 custom labels.

Answer 1

If you want to train a CRF model then you need annotated data; for some tasks it is possible to rely on existing corpora, but if your task is new then you'll have to annotate entities yourselves. There are tools which can help, eg take a look at http://brat.nlplab.org/ . GATE also has annotation tool built-in.

POS tags are often used as features, but they are not strictly required (and you should use many other features as well).

Answer 2

如果您想使用不同的实体（而不只是Location实体或Person实体）创建自己的训练数据，那么可以参考我的答案。是否可以训练斯坦福大学NER系统以识别更多命名实体类型？

Answer 3

Brat is an excellent way to annotate your new dataset. After annotating it, there needs to be a conversion from Standoff format that Brat outputs to the format that Stanford NER accepts.

How to prepare training corpus for CRF model using CRFSuite

Question

3 answers

solution1
2 2016-12-05 13:32:03

solution2
1 2016-12-13 11:21:46

solution3
1 2017-07-28 20:15:16

How to prepare training corpus for CRF model using CRFSuite

Question

3 answers

solution1 2 2016-12-05 13:32:03

solution2 1 2016-12-13 11:21:46

solution3 1 2017-07-28 20:15:16

solution1
2 2016-12-05 13:32:03

solution2
1 2016-12-13 11:21:46

solution3
1 2017-07-28 20:15:16