简体   繁体   English

使用Nokogiri和Ruby从html doc获取链接和href文本?

[英]Get link and href text from html doc with Nokogiri & Ruby?

I'm trying to use the nokogiri gem to extract all the urls on the page as well their link text and store the link text and url in a hash. 我正在尝试使用nokogiri gem来提取页面上的所有URL以及它们的链接文本,并将链接文本和URL存储在哈希中。

<html>
    <body>
        <a href=#foo>Foo</a>
        <a href=#bar>Bar </a>
    </body>
</html>

I would like to return 我想回来

{"Foo" => "#foo", "Bar" => "#bar"}

Here's a one-liner: 这是一个单行:

Hash[doc.xpath('//a[@href]').map {|link| [link.text.strip, link["href"]]}]

#=> {"Foo"=>"#foo", "Bar"=>"#bar"}

Split up a bit to be arguably more readable: 分开一点可以说是更具可读性:

h = {}
doc.xpath('//a[@href]').each do |link|
  h[link.text.strip] = link['href']
end
puts h

#=> {"Foo"=>"#foo", "Bar"=>"#bar"}

Another way: 其他方式:

h = doc.css('a[href]').each_with_object({}) { |n, h| h[n.text.strip] = n['href'] }
# yields {"Foo"=>"#foo", "Bar"=>"#bar"}

And if you're worried that you might have the same text linking to different things then you collect the href s in arrays: 如果你担心你可能有相同的文本链接到不同的东西,那么你收集数组中的href

h = doc.css('a[href]').each_with_object(Hash.new { |h,k| h[k] = [ ]}) { |n, h| h[n.text.strip] << n['href'] }
# yields {"Foo"=>["#foo"], "Bar"=>["#bar"]}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM