将 spaCy `Doc` 转换为 CoNLL 2003 示例

Question

我打算训练一个 Spark NLP 自定义 NER model ，它使用 CoNLL 2003 格式来训练（这个博客甚至留下了一些训练样本数据来加速后续）。 这个“示例数据”对我没有用，因为我有自己的训练数据来训练 model； 但是，此数据由 spaCy Doc对象列表组成，老实说，我不知道如何进行此转换。 到目前为止，我发现了三种方法，每种方法都有一些相当大的弱点：

在 spaCy 的文档中，我找到了一个关于如何使用spacy_conll项目为 CoNLL 构建 SINGLE Doc的示例代码，但请注意它使用空白 spacy model，因此不清楚“我自己的标记数据”在哪里发挥作用； 此外，似乎conll_formatter组件“添加在管道的末尾”，因此似乎“实际上没有完成从 Doc 到 CoNLL 的直接转换”......我的理解是否正确？
在 Prodigy 论坛（spaCy 同一设计师的另一个产品）中，我发现了这个目的，但是“CoNLL”（我想是 2003 年？）格式似乎不完整：POS 标签似乎丢失了（可以通过以下方式轻松获得） Token.pos_ ，还有“Syntactic chunk” （相当于spaCy，似乎不存在）。这四个字段在CoNLL 2003官方文档中有提到。
说到“从 Doc 到 CoNLL 的直接转换”，我还发现了这个基于textacy库的实现，但是这个实现似乎被版本0.11.0弃用了，因为“CONLL-U [...] 没有强制执行或保证” ，所以我不确定是否使用它（顺便说一句，编写这些行时最新的textacy实现是0.12.0 ）

我当前的代码如下所示：

import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span

print("SPACY HELPER MODEL")
base_model = "en_core_web_sm"
nlp = spacy.load(base_model)
to_disable= ['parser', 'lemmatizer', 'ner']
_ = [nlp.remove_pipe(item) for item in to_disable]
print("Base model used: ", base_model)
print("Removed components: ", to_disable)
print("Enabled components: ", nlp.pipe_names)

# Assume text is already available as sentences...
# so no need for spaCy `sentencizer` or similar
print("\nDEMO SPACY DOC LIST BUILDING...", end="")
doc1 = nlp("iPhone X is coming.")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
doc2 = nlp("Space X is nice.")
doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
docs = [doc1, doc2]
print("DONE!")

print("\nCoNLL 2003 CONVERSION:\n")
results = []
for doc in docs:
    # Preliminary: whole sentence
    whole_sentence = doc.text
    # 1st item (CoNLL 2003): word
    words = [token.text for token in doc]
    # 2nd item (CoNLL 2003): POS
    pos = [token.tag_ for token in doc]
    # 3rd item (CoNLL 2003): syntactic chunk tag
    sct = ["[UNKNOWN]" for token in doc]
    # 4th item (CoNLL 2003): named entities
    spacy_entities = [
        (ent.start_char, ent.end_char, ent.label_)
        for ent in doc.ents
    ]
    biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
    results.append((whole_sentence, words, pos, sct, biluo_entities))

for result in results:
    print(
        "\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
        result[0], "\n"
    )
    print("-DOCSTART- -X- -X- O")
    for w,x,y,z in zip(result[1], result[2], result[3], result[4]):
        print(w,x,y,z)

# Pending: write to a file, but that's easy, and out of topic.

给出 output：

DOC TEXT (NOT included in CoNLL 2003, just for demo):  iPhone X is coming.

-DOCSTART- -X- -X- O
iPhone NNP [UNKNOWN] B-GADGET
X NNP [UNKNOWN] L-GADGET
is VBZ [UNKNOWN] O
coming VBG [UNKNOWN] O
. . [UNKNOWN] O

DOC TEXT (NOT included in CoNLL 2003, just for demo):  Space X is nice.

-DOCSTART- -X- -X- O
Space NNP [UNKNOWN] B-BRAND
X NNP [UNKNOWN] L-BRAND
is VBZ [UNKNOWN] O
nice JJ [UNKNOWN] O
. . [UNKNOWN] O

你以前做过这样的事吗？

谢谢！

Answer 1

如果您查看示例 CoNLL 文件，您会发现它们只是用一个空行分隔条目。 所以你只需要使用一个for循环。

for doc in docs:
    for sent in doc.sents:
        print("#", doc) # optional but makes it easier to read
        print(sent._.conll_str)
        print()

CoNLL 文件按句子拆分，而不是 spaCy Doc，但如果你没有句子边界，你可以循环访问文档。 似乎还有一个选项可以直接在组件中打开标题，请参阅他们的 README。

Answer 2

不确定这是否有帮助，但这是我可以添加的内容，

Spark-NLP NER 不会使用你的 POS 标签，所以如果你可以用 foo-bar 值填充它们，那可以简化你的工作。
检查 JSL Annotation Lab 产品。 它允许你 label 数据，它与 Spark-NLP NER 顺利集成。 免费。

Answer 3

在@AlbertoAndreotti 的帮助下，我设法找到了一个功能性的解决方法：

import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span

print("SPACY HELPER MODEL")
base_model = "en_core_web_sm"
nlp = spacy.load(base_model)
to_disable= ['parser', 'lemmatizer', 'ner']
_ = [nlp.remove_pipe(item) for item in to_disable]
print("Base model used: ", base_model)
print("Removed components: ", to_disable)
print("Enabled components: ", nlp.pipe_names)

# Assume text is already available as sentences...
# so no need for spaCy `sentencizer` or similar
print("\nDEMO SPACY DOC LIST BUILDING...", end="")
doc1 = nlp("iPhone X is coming.")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
doc2 = nlp("Space X is nice.")
doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
docs = [doc1, doc2]
print("DONE!")

print("\nCoNLL 2003 CONVERSION:\n")
results = []
for doc in docs:
    # Preliminary: whole sentence
    whole_sentence = doc.text
    # 1st item (CoNLL 2003): word
    words = [token.text for token in doc]
    # 2nd item (CoNLL 2003): POS
    pos = [token.tag_ for token in doc]
    # 3rd item (CoNLL 2003): syntactic chunk tag
    # sct = pos  # Redundant, so will be left out
    # 4th item (CoNLL 2003): named entities
    spacy_entities = [
        (ent.start_char, ent.end_char, ent.label_)
        for ent in doc.ents
    ]
    biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
    results.append((whole_sentence, words, pos, biluo_entities))

for result in results:
    print(
        "\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
        result[0], "\n"
    )
    print("-DOCSTART- -X- -X- O")
    for w,x,y,z in zip(result[1], result[2], result[2], result[3]):
        print(w,x,y,z)

作为补充信息，我发现第三个缺失的项目“句法分块标签”与一个更广泛的问题“短语分块”有关，这恰好是计算机科学中未解决的问题，只有近似值，因此，无论使用何种库，将第 3项具体转换为 CoNLL 2033 都可能有错误。 但是，似乎 Spark NLP 根本不关心第二和第三项，因此此处建议的解决方法是可以接受的。

有关更多详细信息，您可能需要关注此线程。

将 spaCy `Doc` 转换为 CoNLL 2003 示例

问题描述

3 个解决方案

解决方案1
1 2022-10-27 03:41:28

解决方案2
1 2022-11-03 12:57:20

解决方案3
0 已采纳 2022-11-02 18:37:52

将 spaCy `Doc` 转换为 CoNLL 2003 示例

问题描述

3 个解决方案

解决方案1 1 2022-10-27 03:41:28

解决方案2 1 2022-11-03 12:57:20

解决方案3 0 已采纳 2022-11-02 18:37:52

解决方案1
1 2022-10-27 03:41:28

解决方案2
1 2022-11-03 12:57:20

解决方案3
0 已采纳 2022-11-02 18:37:52