简体   繁体   English

Bash提取某些标签之间的所有行

[英]Bash extract all lines between certain tags

I'm trying to get a command to extract several strings depending on amount of tags in xml file. 我试图获取一个命令来提取几个字符串,具体取决于xml文件中标签的数量。 I have such file structure: 我有这样的文件结构:

<task id="0">
some stuff
</task>

<task id="1">
some other stuff
</task>
  1. How can I get all the text between opening and closing tag? 如何获取开始和结束标记之间的所有文本? I've tried awk and sed but with no success. 我尝试了awk和sed,但没有成功。
  2. Will I be able to create multiple strings depending on amount of <task> tags? 我能否根据<task>标签的数量创建多个字符串? I mean, when I take id="0" for start, will it end on correct </task> tag or the last one in file? 我的意思是,当我以id =“ 0”开头时,它将以正确的</task>标记还是文件中的最后一个标记结束?

I advise against handling xml content using line oriented tools such as grep / sed / awk etc. Xml is not a line oriented format; 我建议对处理XML内容使用线定向工具,如grep / sed / awk等XML 不是一种面向行的格式; therefore the specific distribution of xml elements across lines when expressed textually is incidental. 因此,以文本形式表示时,xml元素跨行的特定分布是偶然的。 (You could have your example written in one single line and still it would be equally correct xml formatting.) (您可以将示例写在一行中,但仍然同样是正确的xml格式。)

My suggestion for parsing well formed xml content in shell scripts is the xmlstarlet tool. 我建议在shell脚本中解析格式良好的 xml内容的是xmlstarlet工具。 It's sort of a swiss army knife for dealing with xml in a scriptable way. 这有点像瑞士军刀,用于以可编写脚本的方式处理xml。

First, make sure your xml content is well formed . 首先,请确保您的xml内容格式正确 The following is a well formed xml containing the data of your example: 以下是包含您的示例数据的格式良好的xml:

<?xml version="1.0" encoding="UTF-8"?>
<tasks>
<task id="0">some stuff</task>
<task id="1">some other stuff</task>
<task id="2">yet another stuff</task>
</tasks>

(The "well formedness" of a xml file can be checked with xmlstarlet val .) (可以使用xmlstarlet val检查xml文件的“格式正确”。)

For extracting content from the xml, use xmlstarlet sel . 要从xml中提取内容,请使用xmlstarlet sel This tool requires XPath expressions that it uses for filtering what content must be selected. 该工具需要用于过滤必须选择哪些内容的XPath表达式。 (In most ways, xmlstarlet sel and Xpath are for xml what grep and regular expressions are for line oriented content.) (在大多数情况下, xmlstarlet sel和Xpath用于xml,而grep和正则表达式则用于面向行的内容。)

Examples using the above xml sample saved in file tasks.xml : 使用上述xml示例的示例保存在文件tasks.xml

Extract content of all tasks 提取所有任务的内容

$ xmlstarlet sel -T -t -m '/tasks/task' -v '.' -n tasks.xml 
some stuff
some other stuff
yet another stuff

Get all task ids 获取所有任务ID

$ xmlstarlet sel -T -t -m '/tasks/task' -v '@id' -n tasks.xml 
0
1
2

Extract content of task 0 提取任务0的内容

$ xmlstarlet sel -T -t -m '/tasks/task[@id="0"]' -v '.' -n tasks.xml 
some stuff

Extract content of all tasks whose id is greater than or equal to 1 提取ID大于或等于1的所有任务的内容

$ xmlstarlet sel -T -t -m '/tasks/task[@id>="1"]' -v '.' -n tasks.xml
some other stuff
yet another stuff

Naive conversion to cvs format 天真的转换为cvs格式

$ xmlstarlet sel -T -t -m '/tasks/task' -v '@id' -o ',' -v '.' -n tasks.xml 
0,some stuff
1,some other stuff
2,yet another stuff

On GNU sed: 在GNU sed上:

sed -n '/<task id=/{n;:a;p;n;/<\/task>/!ba;s/.*/---/p;}' filename

Will output: 将输出:

some stuff
---
some other stuff
---

This will search each <task id= on the file and iterate until the next </task> . 这将搜索文件上的每个<task id=并进行迭代,直到下一个</task>为止。 The s/.*/---/p; s/.*/---/p; part convert the closing tag to a separator, you can remove it and get all strings concatenated. 部分将结束标记转换为分隔符,您可以将其删除并连接所有字符串。

I made an HTML/XML pattern matcher for something like this. 我为这样的事情制作了HTML / XML模式匹配器

For example for the first task you can do: 例如,对于第一个任务,您可以执行以下操作:

$ xidel /tmp/xxx.xml -e '<task id="0">{.}</task>'
some stuff

Or for all tasks: 或对于所有任务:

$ xidel /tmp/xxx.xml -e '<task>{.}</task>+'
some stuff
some other stuff

Although in your case with only a single element, it is simpler to use XPath: 尽管在您的情况下只有一个元素,但是使用XPath更简单:

Get the first task: 获得第一个任务:

$ xidel /tmp/xxx.xml -e //task[@id=0]
some stuff

Get all the task content: 获取所有任务内容:

$ xidel /tmp/xxx.xml -e //task
some stuff
some other stuff

This can be done many many ways. 这可以通过许多方法来完成。 The easiest way in my opinion is awk. 我认为最简单的方法是awk。 Put this in a file called task.awk : 把它放在一个名为task.awk的文件中:

BEGIN{x=0;}
/^<\/task>/{x=0;}
{if(x==1)print $0;}
/^<task [^>]*>/{x=1;}

And then if your xml is in task.xml you can: 然后,如果您的xml位于task.xml中,则可以:

awk -f task.awk < task.xml

How it works: 这个怎么运作:

  1. At the beginning set the flag to false. 首先,将标志设置为false。
  2. Then first check to see if we should turn it off because it is a close tag 然后首先检查是否应该关闭它,因为它是一个关闭标签
    • doing this first prevents the close tag from printing 首先执行此操作可防止关闭标签打印
  3. Then only print the line if the flag is on 然后仅在标记处于打开状态时打印该行
  4. Finally check to see if we should turn it on because it is an open tag 最后检查一下是否应该打开它,因为它是一个开放标签
    • doing this last prevents the open tag from printing 最后这样做会阻止打开的标签打印

given this file as source in /tmp/data.xml : /tmp/data.xml将此文件作为源/tmp/data.xml

<task id="0">
some1 stuff for id 0
some2 stuff for id 0
</task>

<task id="1">
some1 stuff for id 1
some2 stuff for id 1
</task>

this code: 此代码:

awk '
/<task id=/{tag_data=$0} 
/<\/task>/{tag_data=tag_data $0 " "; print tag_data} 
{tag_data=tag_data $0 " "}' < /tmp/data

produces the needed result: 产生所需的结果:

<task id="0"><task id="0"> some1 stuff for id 0 some2 stuff for id 0 </task> 
<task id="1"><task id="1"> some1 stuff for id 1 some2 stuff for id 1 </task> 

it does the following: it search for the first opening tag and starts accumulating data in the variable tag_data until it mets the closinig tag. 它执行以下操作:它搜索第一开口标签,并开始在可变累积数据tag_data直到它METS的closinig标签。 At the closing tag you have all needed data between opening and closing tag in tag_data variable. 在结束标记处,您具有在tag_data变量中的开始标记和结束标记之间的所有所需数据。 You can easily modify the code to not store the tags or even parse and store the id in a separate variable. 您可以轻松地修改代码以不存储标签,甚至不分析并将ID存储在单独的变量中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM