简体   繁体   English

使用awk sed或grep来解析来自网页源的URL

[英]Using awk sed or grep to parse URLs from webpage source

I am trying to parse the source of a downloaded web-page in order to obtain the link listing. 我正在尝试解析下载的网页的来源,以获取链接列表。 A one-liner would work fine. 单行可以正常工作。 Here's what I've tried thus far: 这是我到目前为止所尝试的:

This seems to leave out parts of the URL from some of the page names. 这似乎从一些页面名称中省略了部分URL。

$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3

This gets all of the URL's but I do not want to include links that have/are anchor links. 这将获取所有URL,但我不想包含具有/是锚链接的链接。 Also I want to be able to specify the domain.org/folder/: 另外我希望能够指定domain.org/folder/:

$ awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' file.html

If you are only parsing something like < a > tags, you could just match the href attribute like this: 如果您只解析类似<a>标记的内容,则可以像这样匹配href属性:

$ cat file.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq

That will ignore the anchor and also guarantee that you have uniques. 这将忽略锚,并保证你有独特的。 This does assume that the page has well-formed (X)HTML, but you could pass it through Tidy first. 这确实假设页面具有格式良好的(X)HTML,但您可以先将其传递给Tidy。

lynx -dump http://www.ibm.com

And look for the string 'References' in the output. 并在输出中查找字符串'References'。 Post-process with sed if you need to. 如果需要,可以使用sed后处理。

Using a different tool sometimes makes the job simpler. 使用不同的工具有时会使工作更简单。 Once in a while, a different tool makes the job dead simple. 偶尔,一个不同的工具使工作变得简单。 This is one of those times. 这是其中一次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM