Grep HTML源代码中的标题

Question

Below is just an example of one of the webpage i have, i have many webpages with the same format. 以下只是我拥有的其中一个网页的示例，我有许多具有相同格式的网页。 My task is to extract the title information. 我的任务是提取标题信息。 For this example source code i have below, i need to extract the title which is CA-A CANCER JOURNAL FOR CLINICIANS. 对于下面的示例源代码，我需要提取标题为CA-A CANCER JOURNAL OF CLINICIANS的标题。 I have two locations that i can find this title. 我有两个位置可以找到此标题。

<span class="pageHeaderName">CA-A CANCER JOURNAL FOR CLINICIANS</span></h3>

<td valign="top">CA-A CANCER JOURNAL FOR CLINICIANS&nbsp;

I am going to use grep to locate this title and store it as a variable ($i) for instance. 我将使用grep定位此标题并将其存储为例如变量（$ i）。

Tried using this and didn't work. 尝试过使用此方法，但无济于事。

jtitle=$(grep "<span class="pageHeaderName">" $i | head -n 1 | cut -d'>' -f4- | cut -d'<' -f1

Answer 1

Your question is not clear as to how/where you intend to get the title string from. 您不清楚如何/从何处获取标题字符串。 I've the command below to extract the Title from a given html file. 我下面的命令从给定的html文件中提取标题。

jtitle=$(sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q' <file.html>)

jtitle=$(awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' <file.html>)

EDIT : Updated as per the new pattern in the question 编辑：根据问题中的新模式更新

jtitle=$(sed -n 's/.*<span class="pageHeaderName">\(.*\)<\/span>.*/\1/ip;T;q' <file.html>)

Grep HTML源代码中的标题

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-04-14 08:10:04

Grep HTML源代码中的标题

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-04-14 08:10:04

解决方案1
0 已采纳 2015-04-14 08:10:04