Below is just an example of one of the webpage i have, i have many webpages with the same format. My task is to extract the title information. For this example source code i have below, i need to extract the title which is CA-A CANCER JOURNAL FOR CLINICIANS. I have two locations that i can find this title.
<span class="pageHeaderName">CA-A CANCER JOURNAL FOR CLINICIANS</span></h3>
<td valign="top">CA-A CANCER JOURNAL FOR CLINICIANS
I am going to use grep to locate this title and store it as a variable ($i) for instance.
Tried using this and didn't work.
jtitle=$(grep "<span class="pageHeaderName">" $i | head -n 1 | cut -d'>' -f4- | cut -d'<' -f1
Your question is not clear as to how/where you intend to get the title string from. I've the command below to extract the Title from a given html file.
jtitle=$(sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q' <file.html>)
jtitle=$(awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' <file.html>)
EDIT : Updated as per the new pattern in the question
jtitle=$(sed -n 's/.*<span class="pageHeaderName">\(.*\)<\/span>.*/\1/ip;T;q' <file.html>)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.