[英]Regexp for finding href in <a> open-uri ruby
我需要找到兩個使用ruby open-uri的網站之間的距離。 使用
def check(url)
site = open(url.base_url)
link = %r{^<([a])([^"]+)*([^>]+)*(?:>(.*)<\/\1>|\s+\/>)$}
site.each_line {|line| puts $&,$1,$2,$3,$4 if (line=~link)}
p url.links
end
查找鏈接無法正常工作。 有什么想法嗎?
如果要查找a
標簽的href
參數,請使用正確的工具(通常不使用正則表達式)。 您更有可能應該使用HTML / XML解析器。
Nokogiri是Ruby的首選解析器:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.example.org/index.html'))
doc.search('a').map{ |a| a['href'] }
pp doc.search('a').map{ |a| a['href'] }
# => [
# => "/",
# => "/domains/",
# => "/numbers/",
# => "/protocols/",
# => "/about/",
# => "/go/rfc2606",
# => "/about/",
# => "/about/presentations/",
# => "/about/performance/",
# => "/reports/",
# => "/domains/",
# => "/domains/root/",
# => "/domains/int/",
# => "/domains/arpa/",
# => "/domains/idn-tables/",
# => "/protocols/",
# => "/numbers/",
# => "/abuse/",
# => "http://www.icann.org/",
# => "mailto:iana@iana.org?subject=General%20website%20feedback"
# => ]
我看到這個正則表達式有幾個問題:
在空標記中,末尾的斜杠之前不一定必須有空格,但是您的正則表達式需要它
您的正則表達式非常冗長和多余
請嘗試以下操作,它將從<a>標記中提取URL:
link = /<a \s # Start of tag
[^>]* # Some whitespace, other attributes, ...
href=" # Start of URL
([^"]*) # The URL, everything up to the closing quote
" # The closing quotes
/x # We stop here, as regular expressions wouldn't be able to
# correctly match nested tags anyway
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.