正則表達式用於在<a>Open-URI紅寶石中</a>找到href

Question

我需要找到兩個使用ruby open-uri的網站之間的距離。 使用

def check(url)
    site = open(url.base_url)
    link = %r{^<([a])([^"]+)*([^>]+)*(?:>(.*)<\/\1>|\s+\/>)$}
    site.each_line {|line| puts $&,$1,$2,$3,$4 if (line=~link)}
    p url.links
end

查找鏈接無法正常工作。 有什么想法嗎？

Answer 1

如果要查找a標簽的href參數，請使用正確的工具（通常不使用正則表達式）。 您更有可能應該使用HTML / XML解析器。

Nokogiri是Ruby的首選解析器：

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://www.example.org/index.html'))
doc.search('a').map{ |a| a['href'] }

pp doc.search('a').map{ |a| a['href'] }
# => [
# =>  "/",
# =>  "/domains/",
# =>  "/numbers/",
# =>  "/protocols/",
# =>  "/about/",
# =>  "/go/rfc2606",
# =>  "/about/",
# =>  "/about/presentations/",
# =>  "/about/performance/",
# =>  "/reports/",
# =>  "/domains/",
# =>  "/domains/root/",
# =>  "/domains/int/",
# =>  "/domains/arpa/",
# =>  "/domains/idn-tables/",
# =>  "/protocols/",
# =>  "/numbers/",
# =>  "/abuse/",
# =>  "http://www.icann.org/",
# =>  "mailto:iana@iana.org?subject=General%20website%20feedback"
# => ]

Answer 2

我看到這個正則表達式有幾個問題：

在空標記中，末尾的斜杠之前不一定必須有空格，但是您的正則表達式需要它
您的正則表達式非常冗長和多余

請嘗試以下操作，它將從<a>標記中提取URL：

link = /<a \s   # Start of tag
    [^>]*       # Some whitespace, other attributes, ...
    href="      # Start of URL
    ([^"]*)     # The URL, everything up to the closing quote
    "           # The closing quotes
    /x          # We stop here, as regular expressions wouldn't be able to
                # correctly match nested tags anyway

正則表達式用於在<a>Open-URI紅寶石中</a>找到href

問題描述

2 個解決方案

解決方案1
3 2012-11-13 00:18:49

解決方案2
1 2012-11-12 23:24:30

正則表達式用於在<a>Open-URI紅寶石中</a>找到href

問題描述

2 個解決方案

解決方案1 3 2012-11-13 00:18:49

解決方案2 1 2012-11-12 23:24:30

解決方案1
3 2012-11-13 00:18:49

解決方案2
1 2012-11-12 23:24:30