Extract Text between HTML tags with sed or grep

Question

I have a Problem. I want to get two parts of this html in values with the sed or grep command. How i can extract both of them?

test.html:

<html>
 <body>
  <div id="foo" class="foo">
   Some Text.
    <p id="author" class="author">
     <br>
     <a href="example.com">bar</a>
    </p>
  </div>
 </body>
</html>

script.sh

#!/bin/bash

author=$(sed 's/.*<p id="author" class="author"><br><a href="*">\(.*\)<\/a><\/p>.*/\1/p' test.html)
quote=$(sed 's/.*<div id="foo" class="foo">\(.*\)<\/div>.*/\1/p' test.html)

Under the line i want only the text in the values. without the html tags. But my script doesent works..

Answer 1

The code:

text="$(sed 's:^ *::g' < test.html | tr -d \\n)"
author=$(sed 's:.*<p id="author" class="author"><br><a href="[^"]*">\([^<]*\)<.*:\1:' <<<"$text")
quote=$(sed 's:.*<div id="foo" class="foo">\([^<]*\)<.*:\1:' <<<"$text")
echo "'$author' '$quote'"

How it works:

$text is assigned an unindented single-line representation of test.html ; note that : is used as a delimiter for sed instead of / , since any character is capable of being a delimiter, and the text we are parsing has / -s present, so we don`t have to escape them with \ -s when constructing a regex.
$author is assumed to be between <p id="author" class="author"><br><a href="[^"]*"> (where [^"]* means «any characters except " , repeated N times, N ∈ [0, +∞)») and any tag that comes next.
$quote is assumed to be between <div id="foo" class="foo"> and any tag that comes next.
The rather obscure construct <<<"$text" is the so-called here-string , which is almost equivalent to echo "$text" | placed at the beginning.

Answer 2

You can use html2text

# cat test.html | html2text
Some Text.


[bar](example.com)

I'm using very often with curl

# curl -s http://www.example.com/ | html2text

# Example Domain

This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.

[More information...](https://www.iana.org/domains/example)

#

Answer 3

You can use xmllint to parse html/xml text and extract values for defined xpath.

Here is the working example:

#!/bin/bash

author=$(xmllint --html --xpath '//div[@class="foo"]/text()' test.html | tr -d '\n' | sed -e "s/ //g")
quote=$(xmllint --html --xpath '//a/text()' test.html | sed -e "s/ //g")
echo "Author:'$author'"
echo "Quote:'$quote'"

xpath defines xml node path from which text needs to be extracted.
tr is used remove new-line characters.
sed is used to trim string from extracted text value.

Answer 4

Please don't use regex to parse HTML/XML , but use a dedicated parser like xidel instead:

$ xidel -s test.html -e '//p/a,//div/normalize-space(text())'
bar
Some Text.

$ eval $(xidel test.html -se 'author:=//p/a,quote:=//div/normalize-space(text())' --output-format=bash)

$ printf '%s\n' "$author" "$quote"
bar
Some Text.

Extract Text between HTML tags with sed or grep

Question

4 answers

solution1
5 ACCPTED 2017-07-09 10:20:11

The code:

How it works:

solution2
2 2021-01-14 21:08:53

solution3
0 2017-07-09 11:45:17

solution4
0 2021-01-17 12:12:10

Extract Text between HTML tags with sed or grep

Question

4 answers

solution1 5 ACCPTED 2017-07-09 10:20:11

The code:

How it works:

solution2 2 2021-01-14 21:08:53

solution3 0 2017-07-09 11:45:17

solution4 0 2021-01-17 12:12:10

solution1
5 ACCPTED 2017-07-09 10:20:11

solution2
2 2021-01-14 21:08:53

solution3
0 2017-07-09 11:45:17

solution4
0 2021-01-17 12:12:10