[英]Convert spaCy `Doc` into CoNLL 2003 sample
我打算训练一个 Spark NLP 自定义 NER model ,它使用 CoNLL 2003 格式来训练(这个博客甚至留下了一些训练样本数据来加速后续)。 这个“示例数据”对我没有用,因为我有自己的训练数据来训练 model; 但是,此数据由 spaCy Doc对象列表组成,老实说,我不知道如何进行此转换。 到目前为止,我发现了三种方法,每种方法都有一些相当大的弱点:
在 spaCy 的文档中,我找到了一个关于如何使用spacy_conll
项目为 CoNLL 构建 SINGLE Doc的示例代码,但请注意它使用空白 spacy model,因此不清楚“我自己的标记数据”在哪里发挥作用; 此外,似乎conll_formatter
组件“添加在管道的末尾”,因此似乎“实际上没有完成从 Doc 到 CoNLL 的直接转换”......我的理解是否正确?
在 Prodigy 论坛(spaCy 同一设计师的另一个产品)中,我发现了这个目的,但是“CoNLL”(我想是 2003 年?)格式似乎不完整:POS 标签似乎丢失了(可以通过以下方式轻松获得) Token.pos_
,还有“Syntactic chunk” (相当于spaCy,似乎不存在)。这四个字段在CoNLL 2003官方文档中有提到。
说到“从 Doc 到 CoNLL 的直接转换”,我还发现了这个基于textacy
库的实现,但是这个实现似乎被版本0.11.0弃用了,因为“CONLL-U [...] 没有强制执行或保证” ,所以我不确定是否使用它(顺便说一句,编写这些行时最新的textacy
实现是0.12.0
)
我当前的代码如下所示:
import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span
print("SPACY HELPER MODEL")
base_model = "en_core_web_sm"
nlp = spacy.load(base_model)
to_disable= ['parser', 'lemmatizer', 'ner']
_ = [nlp.remove_pipe(item) for item in to_disable]
print("Base model used: ", base_model)
print("Removed components: ", to_disable)
print("Enabled components: ", nlp.pipe_names)
# Assume text is already available as sentences...
# so no need for spaCy `sentencizer` or similar
print("\nDEMO SPACY DOC LIST BUILDING...", end="")
doc1 = nlp("iPhone X is coming.")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
doc2 = nlp("Space X is nice.")
doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
docs = [doc1, doc2]
print("DONE!")
print("\nCoNLL 2003 CONVERSION:\n")
results = []
for doc in docs:
# Preliminary: whole sentence
whole_sentence = doc.text
# 1st item (CoNLL 2003): word
words = [token.text for token in doc]
# 2nd item (CoNLL 2003): POS
pos = [token.tag_ for token in doc]
# 3rd item (CoNLL 2003): syntactic chunk tag
sct = ["[UNKNOWN]" for token in doc]
# 4th item (CoNLL 2003): named entities
spacy_entities = [
(ent.start_char, ent.end_char, ent.label_)
for ent in doc.ents
]
biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
results.append((whole_sentence, words, pos, sct, biluo_entities))
for result in results:
print(
"\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
result[0], "\n"
)
print("-DOCSTART- -X- -X- O")
for w,x,y,z in zip(result[1], result[2], result[3], result[4]):
print(w,x,y,z)
# Pending: write to a file, but that's easy, and out of topic.
给出 output:
DOC TEXT (NOT included in CoNLL 2003, just for demo): iPhone X is coming.
-DOCSTART- -X- -X- O
iPhone NNP [UNKNOWN] B-GADGET
X NNP [UNKNOWN] L-GADGET
is VBZ [UNKNOWN] O
coming VBG [UNKNOWN] O
. . [UNKNOWN] O
DOC TEXT (NOT included in CoNLL 2003, just for demo): Space X is nice.
-DOCSTART- -X- -X- O
Space NNP [UNKNOWN] B-BRAND
X NNP [UNKNOWN] L-BRAND
is VBZ [UNKNOWN] O
nice JJ [UNKNOWN] O
. . [UNKNOWN] O
你以前做过这样的事吗?
谢谢!
如果您查看示例 CoNLL 文件,您会发现它们只是用一个空行分隔条目。 所以你只需要使用一个for循环。
for doc in docs:
for sent in doc.sents:
print("#", doc) # optional but makes it easier to read
print(sent._.conll_str)
print()
CoNLL 文件按句子拆分,而不是 spaCy Doc,但如果你没有句子边界,你可以循环访问文档。 似乎还有一个选项可以直接在组件中打开标题,请参阅他们的 README。
不确定这是否有帮助,但这是我可以添加的内容,
在@AlbertoAndreotti 的帮助下,我设法找到了一个功能性的解决方法:
import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span
print("SPACY HELPER MODEL")
base_model = "en_core_web_sm"
nlp = spacy.load(base_model)
to_disable= ['parser', 'lemmatizer', 'ner']
_ = [nlp.remove_pipe(item) for item in to_disable]
print("Base model used: ", base_model)
print("Removed components: ", to_disable)
print("Enabled components: ", nlp.pipe_names)
# Assume text is already available as sentences...
# so no need for spaCy `sentencizer` or similar
print("\nDEMO SPACY DOC LIST BUILDING...", end="")
doc1 = nlp("iPhone X is coming.")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
doc2 = nlp("Space X is nice.")
doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
docs = [doc1, doc2]
print("DONE!")
print("\nCoNLL 2003 CONVERSION:\n")
results = []
for doc in docs:
# Preliminary: whole sentence
whole_sentence = doc.text
# 1st item (CoNLL 2003): word
words = [token.text for token in doc]
# 2nd item (CoNLL 2003): POS
pos = [token.tag_ for token in doc]
# 3rd item (CoNLL 2003): syntactic chunk tag
# sct = pos # Redundant, so will be left out
# 4th item (CoNLL 2003): named entities
spacy_entities = [
(ent.start_char, ent.end_char, ent.label_)
for ent in doc.ents
]
biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
results.append((whole_sentence, words, pos, biluo_entities))
for result in results:
print(
"\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
result[0], "\n"
)
print("-DOCSTART- -X- -X- O")
for w,x,y,z in zip(result[1], result[2], result[2], result[3]):
print(w,x,y,z)
作为补充信息,我发现第三个缺失的项目“句法分块标签”与一个更广泛的问题“短语分块”有关,这恰好是计算机科学中未解决的问题,只有近似值,因此,无论使用何种库,将第 3项具体转换为 CoNLL 2033 都可能有错误。 但是,似乎 Spark NLP 根本不关心第二和第三项,因此此处建议的解决方法是可以接受的。
有关更多详细信息,您可能需要关注此线程。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.