I am working on NER application where i have data annotated in the following data format.
[('The F15 aircraft uses a lot of fuel', {'entities': [(4, 7, 'aircraft')]}),
('did you see the F16 landing?', {'entities': [(16, 19, 'aircraft')]}),
('how many missiles can a F35 carry', {'entities': [(24, 27, 'aircraft')]}),
('is the F15 outdated', {'entities': [(7, 10, 'aircraft')]}),
('does the US still train pilots to dog fight?',{'entities': [(0, 0, 'aircraft')]}),
('how long does it take to train a F16 pilot',{'entities': [(33, 36, 'aircraft')]}),
('how much does a F35 cost', {'entities': [(16, 19, 'aircraft')]})]
Is there a way to convert this to CONLL format?
Which CoNLL format do you mean?
You can get a simple CoNLL format by doing something like this:
import spacy
data = ... your data ...
nlp = spacy.blank("en")
for text, labels in data:
doc = nlp(text)
ents = []
for start, end, label in labels["entities"]:
ents.append(doc.char_span(start, end, label))
doc.ents = ents
for tok in doc:
label = tok.ent_iob_
if tok.ent_iob_ != "O":
label += '-' + tok.ent_type_
print(tok, label, sep="\t")
There is also a library, spacy_conll , that will do this for you.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.