[英]Unix command & regex to remove XML tag from Corpus
我使用了语料库文本,并编写了以下Unix命令和正则表达式。
我只想提取英文段的不带XML的文本,然后放入一个名为“ file.txt”的文件。
以下代码仅删除<seg>
,但保留结尾的XML标记</seg>
。 查看输入和输出以了解我的问题。
cat uncorpora_plain.txt |grep -a1 '<tuv xml:lang="EN">' |grep '<seg>' |perl -pe 's/\<seg>\b/''/'
提取之前的部分文字:
<tuv xml:lang="EN">
<seg>Adopted at the 81st plenary meeting, on 4 December 2000, on
the recommendation of the Committee (A/55/602/Add.2 and Corr.1,
para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as
follows:</seg>
运行Unix命令后的输出:
Adopted at the 81st plenary meeting, on 4 December 2000, on the
recommendation of the Committee (A/55/602/Add.2 and Corr.1, para. 94),
by a recorded vote of 106 to 1, with 67 abstentions, as follows:</seg>
您的帮助将不胜感激!
sed -e 's/<[^>]*>//g' file.xml
这应该工作
我会重复一个老套的规则: 不要用awk / sed / grep解析xml / html-使用适当的解析器。
xmlstarlet是其中之一。
有效的XML示例:
<root>
<tuv xml:lang="EN">
<seg>Adopted at the 81st plenary meeting, on 4 December 2000, on
the recommendation of the Committee (A/55/602/Add.2 and Corr.1,
para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as
follows:</seg>
</tuv>
<tuv xml:lang="UA">
<seg>УкраÏна - унікальна країна,
багата талановитими людьми ...</seg>
</tuv>
</root>
命令:
xmlstarlet sel -t -v "//tuv[@xml:lang='EN']//seg" -n input.xml > uncorpus.eng.txt
uncorpus.eng.txt
内容:
Adopted at the 81st plenary meeting, on 4 December 2000, on
the recommendation of the Committee (A/55/602/Add.2 and Corr.1,
para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as
follows:
听起来这就是您要的内容(对多字符RS使用GNU awk):
awk -v RS='</seg>' 'sub(/.*<tuv\s+xml:lang="EN">\s*<seg>/,"")'
但是没有可测试的样本输入/输出,这是一个猜测。 在这里,它是在FWIW组成的@RomanPerekhrest输入上运行的:
$ cat file
<root>
<tuv xml:lang="EN">
<seg>Adopted at the 81st plenary meeting, on 4 December 2000, on
the recommendation of the Committee (A/55/602/Add.2 and Corr.1,
para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as
follows:</seg>
</tuv>
<tuv xml:lang="UA">
<seg>УкраÏна - унікальна країна,
багата талановитими людьми ...</seg>
</tuv>
</root>
$ awk -v RS='</seg>' 'sub(/.*<tuv\s+xml:lang="EN">\s*<seg>/,"")' file
Adopted at the 81st plenary meeting, on 4 December 2000, on
the recommendation of the Committee (A/55/602/Add.2 and Corr.1,
para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as
follows:
如果您想摆脱每一行开头的空白:
$ awk -v RS='</seg>' 'sub(/.*<tuv\s+xml:lang="EN">\s*<seg>/,""){ gsub(/\n[[:blank:]]*/,"\n"); print}' file
Adopted at the 81st plenary meeting, on 4 December 2000, on
the recommendation of the Committee (A/55/602/Add.2 and Corr.1,
para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as
follows:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.