简体   繁体   English

Grep HTML源代码中的标题

[英]Grep a title which is inside a HTML source code

Below is just an example of one of the webpage i have, i have many webpages with the same format. 以下只是我拥有的其中一个网页的示例,我有许多具有相同格式的网页。 My task is to extract the title information. 我的任务是提取标题信息。 For this example source code i have below, i need to extract the title which is CA-A CANCER JOURNAL FOR CLINICIANS. 对于下面的示例源代码,我需要提取标题为CA-A CANCER JOURNAL OF CLINICIANS的标题。 I have two locations that i can find this title. 我有两个位置可以找到此标题。

<span class="pageHeaderName">CA-A CANCER JOURNAL FOR CLINICIANS</span></h3>

<td valign="top">CA-A CANCER JOURNAL FOR CLINICIANS&nbsp;

I am going to use grep to locate this title and store it as a variable ($i) for instance. 我将使用grep定位此标题并将其存储为例如变量($ i)。

Tried using this and didn't work. 尝试过使用此方法,但无济于事。

jtitle=$(grep "<span class="pageHeaderName">" $i | head -n 1 | cut -d'>' -f4- | cut -d'<' -f1

Your question is not clear as to how/where you intend to get the title string from. 您不清楚如何/从何处获取标题字符串。 I've the command below to extract the Title from a given html file. 我下面的命令从给定的html文件中提取标题。

jtitle=$(sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q' <file.html>)

jtitle=$(awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' <file.html>)

EDIT : Updated as per the new pattern in the question 编辑:根据问题中的新模式更新

jtitle=$(sed -n 's/.*<span class="pageHeaderName">\(.*\)<\/span>.*/\1/ip;T;q' <file.html>)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM