使用awk sed或grep来解析来自网页源的URL

Question

I am trying to parse the source of a downloaded web-page in order to obtain the link listing. 我正在尝试解析下载的网页的来源，以获取链接列表。 A one-liner would work fine. 单行可以正常工作。 Here's what I've tried thus far: 这是我到目前为止所尝试的：

This seems to leave out parts of the URL from some of the page names. 这似乎从一些页面名称中省略了部分URL。

$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3

This gets all of the URL's but I do not want to include links that have/are anchor links. 这将获取所有URL，但我不想包含具有/是锚链接的链接。 Also I want to be able to specify the domain.org/folder/: 另外我希望能够指定domain.org/folder/：

$ awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' file.html

Answer 1

If you are only parsing something like < a > tags, you could just match the href attribute like this: 如果您只解析类似<a>标记的内容，则可以像这样匹配href属性：

$ cat file.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq

That will ignore the anchor and also guarantee that you have uniques. 这将忽略锚，并保证你有独特的。 This does assume that the page has well-formed (X)HTML, but you could pass it through Tidy first. 这确实假设页面具有格式良好的（X）HTML，但您可以先将其传递给Tidy。

Answer 2

lynx -dump http://www.ibm.com

And look for the string 'References' in the output. 并在输出中查找字符串'References'。 Post-process with sed if you need to. 如果需要，可以使用sed后处理。

Using a different tool sometimes makes the job simpler. 使用不同的工具有时会使工作更简单。 Once in a while, a different tool makes the job dead simple. 偶尔，一个不同的工具使工作变得简单。 This is one of those times. 这是其中一次。

使用awk sed或grep来解析来自网页源的URL

问题描述

2 个解决方案

解决方案1
8 已采纳 2011-03-20 15:19:36

解决方案2
2 2011-03-21 02:27:07

使用awk sed或grep来解析来自网页源的URL

问题描述

2 个解决方案

解决方案1 8 已采纳 2011-03-20 15:19:36

解决方案2 2 2011-03-21 02:27:07

解决方案1
8 已采纳 2011-03-20 15:19:36

解决方案2
2 2011-03-21 02:27:07