简体   繁体   中英

How to match and replace multiline html file with sed

I have a text file something like this.

<tbody>
            <tr>
                <td>
                    String1
                </td>
                <td>
                    String2
                </td>
                <td>
                    String3
                </td>
                    ...
                    ...
                <td>
                    StringN
                </td>
            </tr>
</tbody>

This is the output that I want.

<tbody>
            <tr>
                    String1;String2;String3;... ...;StringN
            </tr>
</tbody>

Here is my BUGGY code.

sed '{
:a
N
$!ba
s|<td.*>\(.*\)</td>|\1|
}'

I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string (I can filter those strings later using that as the delimiter charater). I used the solution given in this URL . Output does not come as I expected.

This is the actual Code

<tbody>
            <tr>
                <td>
                    <a href="/120.52.72.58/80">120.52.72.58:80</a>
                </td>
                <td>
                    HTTP
                </td>
                <td>
                    <span class="text-danger">Transparent</span>
                </td>
                <td>
                    <abbr title="2016-12-15 00:07:46">12h ago</abbr>
                </td>
                <td class="small">
                    <span class="text-muted">&mdash;</span>
                </td>
                <td>
                    <img src="/flags/png/cn.png" alt="China (CN)" title="China (CN)" onerror="this.style.display='none'"> <abbr title="China">CN</abbr>
                </td>
                    <td class="small">
                            Beijing
                    </td>
                    <td class="small">
                            Beijing
                    </td>
                    <td class="small">
                            China Unicom IP network
                    </td>
                <td class="small">
                        <span class="text-muted">&mdash;</span>
                </td>
            </tr>
</tbody>

Output does not come as I expected.

Your sed code does not work because the <td.*>\\(.*\\)</td> matches the part of the pattern space from the first <td up to the last </td> due to the greediness of the * quantifier. Unfortunately, sed doesn't support a more modern regex flavor with ungreedy quantifiers; thus, some other tool would be more appropriate.

I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string …

If those tags are always (as in your examples) on a separate line, we can do with a simple sed command:

sed '/<\/*td.*>/d'

All the strings are thereafter delimited by some string which is \\n followed by spaces.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM