简体   繁体   English

如何使用Spacy的convert保留conllu文件中的段落信息?

[英]How to use Spacy's convert to keep paragraph information from conllu files?

I'm trying to convert conllu files to Spacy's jsonl format. 我正在尝试将conllu文件转换为Spacy的jsonl格式。 These conllu files contain paragraph information as specified in Universal Dependencies' website . 这些conllu文件包含在Universal Dependencies网站上指定的段落信息。 The problem is that the paragraph information is not carrying over to the jasonl converted file where each paragraph contain a single sentence. 问题在于,段落信息没有传递到jasonl转换文件中,其中每个段落都包含一个句子。

I'm running Spacy version 2.1.3 and using only the obligatory arguments from the spacy convert command , basically python -m spacy input.conllu output_dir 我正在运行Spacy 2.1.3版本并仅使用spacy convert命令的强制性参数,基本上是python -m spacy input.conllu output_dir

Here are the first few sentences from one of my conllu files (maybe they are not to specification?). 这是我的一个conllu文件中的前几句话(也许它们不符合规范?)。 For the sake of readability, I'm only pasting the first few tokens of each sentence. 为了便于阅读,我仅粘贴每个句子的前几个标记。

# sent_id = tp2-p1-s1
# O cansaço começou a afetar os vestibulandos no terceiro dia de exame da Fuvest.
1   O   O   DET DET gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER  2   DET _   _
2   cansaço cansaço NOUN    NOUN    gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER  5   NSUBJ   _   _
3   começou começar VERB    VERB    aspect=PERFECTIVE|mood=INDICATIVE|number=SINGULAR|person=THIRD|proper=NOT_PROPER|tense=PAST 5   AUX _   _

# sent_id = tp2-p1-s2
# "Estou meio cheia, mesmo", afirmou a candidata a filosofia Scyla Pereira Gouveia, 19, que fez as provas de biologia e química, de ontem, no colégio Pueri Domus.
1   "   "   PUNCT   PUNCT   proper=NOT_PROPER   2   P   _   _
2   Estou   Estar   VERB    VERB    aspect=IMPERFECTIVE|mood=INDICATIVE|number=SINGULAR|person=FIRST|proper=NOT_PROPER|tense=PRESENT    0   ROOT    _   _
3   meio    meio    NOUN    NOUN    gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER  2   DOBJ    _   _
4   cheia   cheio   ADJ ADJ gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER  3   AMOD    _   _

# sent_id = tp2-p1-s3
# Seu namorado, Guilherme Schneider, 18, que presta engenharia, faz exame no mesmo local.
1   Seu Seu PRON    PRON    gender=MASCULINE|number=SINGULAR|person=THIRD|proper=NOT_PROPER 2   DET _   _
2   namorado    namorado    NOUN    NOUN    gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER  13  NSUBJ   _   _

# newpar id = tp2-p2
# sent_id = tp2-p2-s1
# Pelo menos um dos 38.454 convocados para a segunda fase da Fuvest tem fortes motivos para não concluir hoje as provas.
1   Pelo    Pelo    ADP ADP gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER  3   ADVMOD  _   _
2   menos   menos   NOUN    NOUN    gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER  1   MWE _   _
3   um  um  NUM NUM gender=MASCULINE|proper=NOT_PROPER  13  NSUBJ   _   _

I expected the output of convert to be one file containing 2 lines, one for each paragraph. 我期望convert的输出是一个包含2行的文件,每个段落一个。 I'm getting 4 lines, one for each sentence. 我得到4行,每句话一行。

I would really like to avoid building a converter of my own, if at all possible. 如果可能的话,我真的想避免构建自己的转换器。

Thanks in advance 提前致谢

As it turns out, spaCy is prepared to have paragraph information, but, as of the writing of this answer, this is unused information. 事实证明,spaCy准备有段落信息,但是,在撰写此答案时,这是未使用的信息。

For now, in training models that are supposed to learn sentencing, it's necessary to use the --n-sents option when using the converter 目前,在应该学习量刑的训练模型中,使用转换器时必须使用--n-sents选项

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM