[英]How to use Spacy's convert to keep paragraph information from conllu files?
I'm trying to convert conllu files to Spacy's jsonl format. 我正在尝试将conllu文件转换为Spacy的jsonl格式。 These conllu files contain paragraph information as specified in Universal Dependencies' website .
这些conllu文件包含在Universal Dependencies网站上指定的段落信息。 The problem is that the paragraph information is not carrying over to the jasonl converted file where each paragraph contain a single sentence.
问题在于,段落信息没有传递到jasonl转换文件中,其中每个段落都包含一个句子。
I'm running Spacy version 2.1.3 and using only the obligatory arguments from the spacy convert command , basically python -m spacy input.conllu output_dir
我正在运行Spacy 2.1.3版本并仅使用spacy convert命令的强制性参数,基本上是
python -m spacy input.conllu output_dir
Here are the first few sentences from one of my conllu files (maybe they are not to specification?). 这是我的一个conllu文件中的前几句话(也许它们不符合规范?)。 For the sake of readability, I'm only pasting the first few tokens of each sentence.
为了便于阅读,我仅粘贴每个句子的前几个标记。
# sent_id = tp2-p1-s1
# O cansaço começou a afetar os vestibulandos no terceiro dia de exame da Fuvest.
1 O O DET DET gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 2 DET _ _
2 cansaço cansaço NOUN NOUN gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 5 NSUBJ _ _
3 começou começar VERB VERB aspect=PERFECTIVE|mood=INDICATIVE|number=SINGULAR|person=THIRD|proper=NOT_PROPER|tense=PAST 5 AUX _ _
# sent_id = tp2-p1-s2
# "Estou meio cheia, mesmo", afirmou a candidata a filosofia Scyla Pereira Gouveia, 19, que fez as provas de biologia e química, de ontem, no colégio Pueri Domus.
1 " " PUNCT PUNCT proper=NOT_PROPER 2 P _ _
2 Estou Estar VERB VERB aspect=IMPERFECTIVE|mood=INDICATIVE|number=SINGULAR|person=FIRST|proper=NOT_PROPER|tense=PRESENT 0 ROOT _ _
3 meio meio NOUN NOUN gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 2 DOBJ _ _
4 cheia cheio ADJ ADJ gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 3 AMOD _ _
# sent_id = tp2-p1-s3
# Seu namorado, Guilherme Schneider, 18, que presta engenharia, faz exame no mesmo local.
1 Seu Seu PRON PRON gender=MASCULINE|number=SINGULAR|person=THIRD|proper=NOT_PROPER 2 DET _ _
2 namorado namorado NOUN NOUN gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 13 NSUBJ _ _
# newpar id = tp2-p2
# sent_id = tp2-p2-s1
# Pelo menos um dos 38.454 convocados para a segunda fase da Fuvest tem fortes motivos para não concluir hoje as provas.
1 Pelo Pelo ADP ADP gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 3 ADVMOD _ _
2 menos menos NOUN NOUN gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 1 MWE _ _
3 um um NUM NUM gender=MASCULINE|proper=NOT_PROPER 13 NSUBJ _ _
I expected the output of convert to be one file containing 2 lines, one for each paragraph. 我期望convert的输出是一个包含2行的文件,每个段落一个。 I'm getting 4 lines, one for each sentence.
我得到4行,每句话一行。
I would really like to avoid building a converter of my own, if at all possible. 如果可能的话,我真的想避免构建自己的转换器。
Thanks in advance 提前致谢
As it turns out, spaCy is prepared to have paragraph information, but, as of the writing of this answer, this is unused information. 事实证明,spaCy准备有段落信息,但是,在撰写此答案时,这是未使用的信息。
For now, in training models that are supposed to learn sentencing, it's necessary to use the --n-sents
option when using the converter 目前,在应该学习量刑的训练模型中,使用转换器时必须使用
--n-sents
选项
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.