简体   繁体   中英

Grep a title which is inside a HTML source code

Below is just an example of one of the webpage i have, i have many webpages with the same format. My task is to extract the title information. For this example source code i have below, i need to extract the title which is CA-A CANCER JOURNAL FOR CLINICIANS. I have two locations that i can find this title.

<span class="pageHeaderName">CA-A CANCER JOURNAL FOR CLINICIANS</span></h3>

<td valign="top">CA-A CANCER JOURNAL FOR CLINICIANS&nbsp;

I am going to use grep to locate this title and store it as a variable ($i) for instance.

Tried using this and didn't work.

jtitle=$(grep "<span class="pageHeaderName">" $i | head -n 1 | cut -d'>' -f4- | cut -d'<' -f1

Your question is not clear as to how/where you intend to get the title string from. I've the command below to extract the Title from a given html file.

jtitle=$(sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q' <file.html>)

jtitle=$(awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' <file.html>)

EDIT : Updated as per the new pattern in the question

jtitle=$(sed -n 's/.*<span class="pageHeaderName">\(.*\)<\/span>.*/\1/ip;T;q' <file.html>)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM