简体   繁体   English

如何将带有命名实体的 CoNNL 格式的文本导入 spaCy,使用我的 model 推断实体并将它们写入同一数据集(使用 Python)?

[英]How to import text from CoNNL format with named entities into spaCy, infer entities with my model and write them to the same dataset (with Python)?

I have a dataset in CoNLL NER format which is basically a TSV file with two fields.我有一个 CoNLL NER 格式的数据集,它基本上是一个包含两个字段的 TSV 文件。 The first field contains tokens from some text - one token per line (each punctuation symbol is also considered a token there) and the second field contains named entity tags for tokens in BIO format.第一个字段包含来自某些文本的标记 - 每行一个标记(每个标点符号也被视为一个标记),第二个字段包含 BIO 格式标记的命名实体标签。

I would like to load this dataset into spaCy, infer new named entity tags for the text with my model and write these tags into the same TSV file as the new third column.我想将此数据集加载到 spaCy 中,用我的 model 为文本推断新的命名实体标签,并将这些标签写入与新的第三列相同的 TSV 文件中。 All I know is that I can infer named entities with something like this:我所知道的是,我可以通过以下方式推断命名实体:

nlp = spacy.load("some_spacy_ner_model")
text = "text from conll dataset"
doc = nlp(text)

Also I managed to convert the CoNLL dataset into spaCy's json format with this CLI command:我还设法使用以下 CLI 命令将 CoNLL 数据集转换为 spaCy 的 json 格式:

python -m spacy convert conll_dataset.tsv /Users/user/docs -t json -c ner

But I don't know where to go from here.但我不知道 go 从这里到哪里。 Could not find how to load this json file into a spaCy Doc format.找不到如何将此json文件加载到 spaCy Doc格式。 I tried this piece of code (found it in spaCy's documentation):我尝试了这段代码(在 spaCy 的文档中找到):

from spacy.tokens import Doc
from spacy.vocab import Vocab
doc = Doc(Vocab()).from_disk("sample.json")

but it throws an error saying ExtraData: unpack(b) received extra data.但它会抛出一个错误, ExtraData: unpack(b) received extra data. . .

Also I don't know how to write ners from doc object back into the same TSV file aligning tokens and NER tags with existing lines of the TSV file.此外,我不知道如何将doc object 中的 ners 写回相同的 TSV 文件,将令牌和 NER 标记与 TSV 文件的现有行对齐。

And here's an extract from the TSV file as an example of the data I am dealing with:以下是 TSV 文件的摘录,作为我正在处理的数据的示例:

The O
epidermal   B-Protein
growth  I-Protein
factor  I-Protein
precursor   O
.   O

There is a bit of gap in the spacy API here, since this format is usually only used for training models.这里的spacy API 有一点差距,因为这种格式通常只用于训练模型。 It's possible, but it's not obvious.这是可能的,但并不明显。 You have to load the corpus as it would be loaded for training as GoldCorpus , which gives you tokenized but otherwise unannotated Docs and GoldParses with the annotation in a raw format.您必须加载语料库,因为它将被加载为GoldCorpus进行训练,这会为您提供标记化但未注释的 Docs 和 GoldParses 以及原始格式的注释。

Then you can convert the raw GoldParse annotations to the right format and add them to the Doc by hand.然后,您可以将原始GoldParse注释转换为正确的格式并手动将它们添加到Doc中。 Here's a sketch for entities:这是实体的草图:

import spacy
from spacy.gold import GoldCorpus
nlp = spacy.load('en')
gc = GoldCorpus("file.json", "file.json")
for doc, gold in gc.dev_docs(nlp, gold_preproc=True):
    doc.ents = spacy.gold.spans_from_biluo_tags(doc, gold.ner)
    spacy.displacy.serve(doc, style='ent')

dev_docs() is used here because it loads the docs without any further shuffling, augmenting, etc. as it might for training and it is loading the file in the second argument to GoldCorpus .此处使用dev_docs()是因为它在加载文档时没有任何进一步的改组、扩充等,因为它可能用于训练,并且它在GoldCorpus的第二个参数中加载文件。 GoldCorpus requires a training file and a dev file, so the first argument is necessary but we're not doing anything further with the data loaded from the first argument. GoldCorpus需要一个训练文件和一个开发文件,所以第一个参数是必需的,但我们不会对从第一个参数加载的数据做任何进一步的事情。

For now, use spacy 2.1.8 for this, since there's a bug for the gold_preproc option in 2.2.1.现在,为此使用 spacy 2.1.8,因为 2.2.1 中的gold_preproc选项存在错误。 gold_preproc preserves your original tokenization rather than retokenizing with spacy. gold_preproc保留您的原始标记,而不是用 spacy 重新标记。 If you don't care about preserving the tokenization, you can set gold_preproc=False and then spacy's provided models will work slightly better because the tokenization is identical.如果您不关心保留标记化,您可以设置gold_preproc=False ,然后 spacy 提供的模型会更好地工作,因为标记化是相同的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM