简体   繁体   中英

Converting Spacy NER entity format to CONLL format

I am working on NER application where i have data annotated in the following data format.

[('The F15 aircraft uses a lot of fuel', {'entities': [(4, 7, 'aircraft')]}),
 ('did you see the F16 landing?', {'entities': [(16, 19, 'aircraft')]}),
 ('how many missiles can a F35 carry', {'entities': [(24, 27, 'aircraft')]}),
 ('is the F15 outdated', {'entities': [(7, 10, 'aircraft')]}),
 ('does the US still train pilots to dog fight?',{'entities': [(0, 0, 'aircraft')]}),
 ('how long does it take to train a F16 pilot',{'entities': [(33, 36, 'aircraft')]}),
 ('how much does a F35 cost', {'entities': [(16, 19, 'aircraft')]})]

Is there a way to convert this to CONLL format?

Which CoNLL format do you mean?

You can get a simple CoNLL format by doing something like this:

import spacy

data = ... your data ...

nlp = spacy.blank("en")

for text, labels in data:
    doc = nlp(text)
    ents = []
    for start, end, label in labels["entities"]:
        ents.append(doc.char_span(start, end, label))
    doc.ents = ents
    for tok in doc:
        label = tok.ent_iob_
        if tok.ent_iob_ != "O":
            label += '-' + tok.ent_type_
        print(tok, label, sep="\t")

There is also a library, spacy_conll , that will do this for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM