简体   繁体   中英

How do you use 'grep' on this line? Linux

<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>

I want to extract the words "with 3km/h SSW winds" (note this string will change so hardcoding it wont work) from the line above using the 'grep' command. I have been trying for a long time and am completely lost. Any help would be appreciated.

Here's a GNU grep solution that uses -P to activate support for PCREs (Perl-Compatible Regular Expressions):

grep -Po '"cur_wind">\K[^<]+' \
  <<<'<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'
  • -o specifies that only the matching string be output
  • \\K is a PCRE-feature that drops everything matched so far; this allows providing context for more specific matching without including that context in the match.

Another option is to use a look-behind assertion in lieu of \\K :

 grep -Po '(?<="cur_wind">)[^<]+' \
  <<<'<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'

Of course, this kind of matching relies on the specific formatting of the input string (whitespace, single- vs. double-quoting, ordering of attributes, ... - in addition to the fundamental problem of grep not understanding the structure of the data) and is thus fragile.

Thus, in general, as others have noted, grep is the wrong tool for the job.

On OSX , assuming the input is XML (or XHTML), you can parse robustly with the stock xmllint utility and an XPath expression:

xmllint --xpath '//span[@class="cur_wind"]/text()' - <<<\
 '<td><span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'

Here's a similar solution using a third-party utility , the multi-platform web-scraping utility xidel (which handles both HTML and XML):

xidel -q -e '//span[@class="cur_wind"]' - <<<\
 '<td><span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'

Try sed:

echo '<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>' | sed -e 's/<[^>]*>//g'

Output

with 3km/h SSW winds

Explanation

  • echo 'whatever' will echo the word whatever to the screen (stdandard output aka stdout)
  • The | symbol is a pipe. Command to the right of that will take the output from echo and do something with it
  • sed is stream editor. It's -e switch tells sed to evaluate a script or expression
  • s/xyz/abc/g format is simple. s/ means substitute. /g means globally. Substitute all occurrences of xyz with abc globally
  • s/<[^>]*>//g gets interesting. Let's focus on <[^>]*> . It means, substitute anything that starts with <, does not contain > immediately but contains any other character and then has > with empty
  • Check out your <span class="cur_wind"> for example. That tag starts with <, then contains characters immediately after and then has a >. sed says, when such text is found, chop it off (replace with empty)
  • Same technique is applied for <hr> and </td> . What remains is the text you want

This is a somewhat simplified explanation.

grep doesn't know XML, and thus is the wrong tool for the job; use a real XML parser. One of the better ones easily accessible from bash is XMLStarlet .

xmlstarlet sel -t -m "//span[@class='cur_wind']/text()" -v . -n <input.xml

This extracts all text directly contained within a span of the class cur_wind .

if that is all you want then cat | grep ". with 3km/h SSW winds. " should do it, but i suspect there is more then that that you need

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM