[英]How to match and replace multiline html file with sed
I have a text file something like this. 我有一个类似这样的文本文件。
<tbody>
<tr>
<td>
String1
</td>
<td>
String2
</td>
<td>
String3
</td>
...
...
<td>
StringN
</td>
</tr>
</tbody>
This is the output that I want. 这是我想要的输出。
<tbody>
<tr>
String1;String2;String3;... ...;StringN
</tr>
</tbody>
Here is my BUGGY code. 这是我的BUGGY代码。
sed '{
:a
N
$!ba
s|<td.*>\(.*\)</td>|\1|
}'
I wanted to remove all <td>
and </td>
tags and get all the strings delimitered by some string (I can filter those strings later using that as the delimiter charater). 我想删除所有<td>
和</td>
标记,并用某个字符串定界所有字符串(我可以稍后将其用作定界符来过滤那些字符串)。 I used the solution given in this URL . 我使用了该URL中给出的解决方案。 Output does not come as I expected. 输出不符合我的预期。
This is the actual Code 这是实际的代码
<tbody>
<tr>
<td>
<a href="/120.52.72.58/80">120.52.72.58:80</a>
</td>
<td>
HTTP
</td>
<td>
<span class="text-danger">Transparent</span>
</td>
<td>
<abbr title="2016-12-15 00:07:46">12h ago</abbr>
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
<td>
<img src="/flags/png/cn.png" alt="China (CN)" title="China (CN)" onerror="this.style.display='none'"> <abbr title="China">CN</abbr>
</td>
<td class="small">
Beijing
</td>
<td class="small">
Beijing
</td>
<td class="small">
China Unicom IP network
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
</tr>
</tbody>
Output does not come as I expected. 输出不符合我的预期。
Your sed
code does not work because the <td.*>\\(.*\\)</td>
matches the part of the pattern space from the first <td
up to the last </td>
due to the greediness of the *
quantifier. 你sed
代码不会因为工作<td.*>\\(.*\\)</td>
匹配模式空间的从第一部分<td
直到最后</td>
由于贪婪的*
量词。 Unfortunately, sed
doesn't support a more modern regex flavor with ungreedy quantifiers; 不幸的是, sed
不支持带有不合要求的量词的更现代的正则表达式。 thus, some other tool would be more appropriate. 因此,其他一些工具会更合适。
I wanted to remove all
<td>
and</td>
tags and get all the strings delimitered by some string … 我想删除所有<td>
和</td>
标记,并用某个字符串分隔所有字符串……
If those tags are always (as in your examples) on a separate line, we can do with a simple sed
command: 如果这些标记始终(如您的示例中)始终位于单独的行中,则可以使用简单的sed
命令进行操作:
sed '/<\/*td.*>/d'
All the strings are thereafter delimited by some string which is \\n
followed by spaces. 所有的字符串是由一些字符串此后被分隔\\n
其次是空间。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.