简体   繁体   中英

Nokogiri Ruby HTML Parser

I'm running into problems scraping across multiple pages with Nokogiri. I need to be able to narrow down the results of what I am searching for based on the qualified hrefs first. So here is a script to get all of the hrefs I'm interested in obtaining. However, I'm having trouble parsing out the titles of the article so that I can link to them. It would be great to know that I can manually inspect the elements so that I have the links I want and whenever I find a link I want I can also grab the title/ text describing the article/href as in

<a href.......>Text Linked to</a>

so that I then have a hash with {:source => ".....", :url => ".....", :title => "....."}. Here is the script I have so far. It narrows down the links I am interested in having setup in the hash.

require 'nokogiri'
require 'open-uri'

page = "http://www.huffingtonpost.com/politics/"

doc = Nokogiri::HTML(open(page))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if{|href| href.empty?}

hrefs.each do |h|
    if h.reverse[0,9] != "stnemmoc#"
        if (h.reverse[0,7] == "scitilo") & (h.length > 65)
            puts h
        end
    end
end

If someone could help and maybe explain how it is that I can find the hrefs I want first and then parse the text based on filtering the urls from the hrefs first, as I have here, that would be really nice. Also is it recommended that these Nokogiri scripts are put in Controllers and then sent into the database that way in Rails? I appreciate it.

Thanks

I'm not sure I understand your question completely, but I'm going to interpret it as "How do I extract links and access their attributes?"

Simply amend your selector:

links = doc.css('a[href]')

This will give you all a elements that have an href . You can then iterate over these and access their attributes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM