Unix命令和正则表达式从语料库中删除XML标签

Question

我使用了语料库文本，并编写了以下Unix命令和正则表达式。

我只想提取英文段的不带XML的文本，然后放入一个名为“ file.txt”的文件。

以下代码仅删除<seg> ，但保留结尾的XML标记</seg> 。 查看输入和输出以了解我的问题。

cat uncorpora_plain.txt |grep -a1 '<tuv xml:lang="EN">' |grep '<seg>' |perl -pe 's/\<seg>\b/''/'

提取之前的部分文字：

  <tuv xml:lang="EN">
    <seg>Adopted at the 81st plenary meeting, on 4 December 2000, on 
   the recommendation of the Committee (A/55/602/Add.2 and Corr.1, 
   para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as 
   follows:</seg>

运行Unix命令后的输出：

    Adopted at the 81st plenary meeting, on 4 December 2000, on the 
 recommendation of the Committee (A/55/602/Add.2 and Corr.1, para. 94), 
 by a recorded vote of 106 to 1, with 67 abstentions, as follows:</seg>

您的帮助将不胜感激！

Answer 1

sed -e 's/<[^>]*>//g' file.xml

这应该工作

Answer 2

我会重复一个老套的规则： 不要用awk / sed / grep解析xml / html-使用适当的解析器。

xmlstarlet是其中之一。

有效的XML示例：

<root>
 <tuv xml:lang="EN">
    <seg>Adopted at the 81st plenary meeting, on 4 December 2000, on 
   the recommendation of the Committee (A/55/602/Add.2 and Corr.1, 
   para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as 
   follows:</seg>
</tuv>
 <tuv xml:lang="UA">
    <seg>УкраÏна - унікальна країна,
     багата талановитими людьми ...</seg>
</tuv>
</root>

命令：

xmlstarlet sel -t -v "//tuv[@xml:lang='EN']//seg" -n input.xml > uncorpus.eng.txt

uncorpus.eng.txt内容：

Adopted at the 81st plenary meeting, on 4 December 2000, on 
   the recommendation of the Committee (A/55/602/Add.2 and Corr.1, 
   para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as 
   follows:

Answer 3

听起来这就是您要的内容（对多字符RS使用GNU awk）：

awk -v RS='</seg>' 'sub(/.*<tuv\s+xml:lang="EN">\s*<seg>/,"")'

但是没有可测试的样本输入/输出，这是一个猜测。 在这里，它是在FWIW组成的@RomanPerekhrest输入上运行的：

$ cat file
<root>
 <tuv xml:lang="EN">
    <seg>Adopted at the 81st plenary meeting, on 4 December 2000, on
   the recommendation of the Committee (A/55/602/Add.2 and Corr.1,
   para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as
   follows:</seg>
</tuv>
 <tuv xml:lang="UA">
    <seg>УкраÏна - унікальна країна,
     багата талановитими людьми ...</seg>
</tuv>
</root>

$ awk -v RS='</seg>' 'sub(/.*<tuv\s+xml:lang="EN">\s*<seg>/,"")' file
Adopted at the 81st plenary meeting, on 4 December 2000, on
   the recommendation of the Committee (A/55/602/Add.2 and Corr.1,
   para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as
   follows:

如果您想摆脱每一行开头的空白：

$ awk -v RS='</seg>' 'sub(/.*<tuv\s+xml:lang="EN">\s*<seg>/,""){ gsub(/\n[[:blank:]]*/,"\n"); print}' file
Adopted at the 81st plenary meeting, on 4 December 2000, on
the recommendation of the Committee (A/55/602/Add.2 and Corr.1,
para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as
follows:

Unix命令和正则表达式从语料库中删除XML标签

问题描述

3 个解决方案

解决方案1
1 2017-09-16 02:17:20

解决方案2
1 2017-09-16 08:03:31

解决方案3
0 2017-09-16 17:10:24

Unix命令和正则表达式从语料库中删除XML标签

问题描述

3 个解决方案

解决方案1 1 2017-09-16 02:17:20

解决方案2 1 2017-09-16 08:03:31

解决方案3 0 2017-09-16 17:10:24

解决方案1
1 2017-09-16 02:17:20

解决方案2
1 2017-09-16 08:03:31

解决方案3
0 2017-09-16 17:10:24