Grep a title which is inside a HTML source code

Question

Below is just an example of one of the webpage i have, i have many webpages with the same format. My task is to extract the title information. For this example source code i have below, i need to extract the title which is CA-A CANCER JOURNAL FOR CLINICIANS. I have two locations that i can find this title.

<span class="pageHeaderName">CA-A CANCER JOURNAL FOR CLINICIANS</span></h3>

<td valign="top">CA-A CANCER JOURNAL FOR CLINICIANS&nbsp;

I am going to use grep to locate this title and store it as a variable ($i) for instance.

Tried using this and didn't work.

jtitle=$(grep "<span class="pageHeaderName">" $i | head -n 1 | cut -d'>' -f4- | cut -d'<' -f1

Answer 1

Your question is not clear as to how/where you intend to get the title string from. I've the command below to extract the Title from a given html file.

jtitle=$(sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q' <file.html>)

jtitle=$(awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' <file.html>)

EDIT : Updated as per the new pattern in the question

jtitle=$(sed -n 's/.*<span class="pageHeaderName">\(.*\)<\/span>.*/\1/ip;T;q' <file.html>)

Grep a title which is inside a HTML source code

Question

1 answers

solution1
0 ACCPTED 2015-04-14 08:10:04

Grep a title which is inside a HTML source code

Question

1 answers

solution1 0 ACCPTED 2015-04-14 08:10:04

solution1
0 ACCPTED 2015-04-14 08:10:04