简体   繁体   中英

Extract text with sed

I have this text file (it's really a part of an html):

<tr>
              <td width="10%" valign="top"><P>Name:</P></td>
              <td colspan="2"><P>
                XXXXX
              </P></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>City:</p></td>
              <td colspan="2"><p>
                Mycity
              </p></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>County:</p></td>
              <td colspan="2"><p>
                YYYYYY
              </p></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>Map:</p></td>
              <td colspan="2"><p>
                ZZZZZZZZ

I've used this sed command to extract "Mycity"

$ tr -d '\n' < file.html | sed -n 's/.*City:<\/p><\/td>.*<p>\(.*\)<\/p><\/td>.*/\1/p'

The regular expression as far as I know works but I get

Map:

Instead of Mycity .

I've tested the REGEX with Rubular and works but not with sed. Is sed not the right tool? What I¡m I doing wrong?

PS: I'm using Linux

The problem that you have right now is that regex is greedy by default

's/.*City:<\/p><\/td>.*<p>\(.*\)<\/p><\/td>.*/\1/p'
                     ^ // here!

So it's matching everything up to the last section. To be non-greedy use a ?

's/.*City:<\/p><\/td>.*?<p>\(.*\)<\/p><\/td>.*/\1/p'
                       ^

sed is always the wrong tool for anything that involves processing multiple lines. Just use awk, it's what it was invented to do:

$ awk 'c&&!--c; /City:/{c=2}' file.html
                Mycity

See Printing with sed or awk a line following a matching pattern

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM