简体   繁体   中英

Unable to scrape data using Nokogiri in Ruby

I am currently trying to scrape data from a webpage using Nokogiri. I want to scrape data for the list of service centers from the link http://www.cardekho.com/Maruti/Noida/car-service-center.htm

The code I have written for same is:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://www.cardekho.com/Maruti/Noida/car-service-center.htm"))

doc.css('.delrname').each do |node|
    puts node.text
end

I have tried a bunch of combination of CSS tags but none of them is giving the desired result.Can anybody suggest the tag that will correctly scrape the data for list of service centers from this link ?

Thanks in advance

PS: The same code(with appropriate CSS tag) when I tested on other websites is working as expected, but it is not working on this website.

Your code seems work. I have removed the white spaces in the url:

doc = Nokogiri::HTML(open("http://www.cardekho.com/Maruti/Noida/car-service-center.htm"))

then I have try it and this is the output:

$ ruby file.rb                                                                                                                                              Fast Track Auto Care India
Jkm Motors
Mangalam Motors
Motorcraft India
Motorcraft India
Rohan Motors
Rohan Motors
Rohan Motors
Vipul Motors

Optionally, you can use Regular Expressions to get more detailed result... for example, using:

/(<div class="delrname">([^<]*)<\/div><p>([^<]*)<\/p><div><div class="delermobcol "><div class="clearfix"><span class="mobico sprite"><\/span><div class="mobno">([^<]*)<\/div><\/div><div class="clear"><\/div><div class="viewsercntr"><a href="([^"]*)" title="View Car Dealers for Maruti in Noida">View Car Dealers for Maruti in Noida<\/a><\/div><\/div><div class="delermoilcol"><!----><div class="clearfix"><span class="mailico sprite"><\/span><div class="mobno"><a href="mailto:([^"]*)" target="_top">workshop.grn@rohanmotors.co.in<\/a><\/div>)/

You can break out results such as:

arrMatches = doc.scan(/(<div class="delrname">([^<]*)<\/div><p>([^<]*)<\/p><div><div class="delermobcol "><div class="clearfix"><span class="mobico sprite"><\/span><div class="mobno">([^<]*)<\/div><\/div><div class="clear"><\/div><div class="viewsercntr"><a href="([^"]*)" title="View Car Dealers for Maruti in Noida">View Car Dealers for Maruti in Noida<\/a><\/div><\/div><div class="delermoilcol"><!----><div class="clearfix"><span class="mailico sprite"><\/span><div class="mobno"><a href="mailto:([^"]*)" target="_top">workshop.grn@rohanmotors.co.in<\/a><\/div>)/)

arrMatches.each do |dealerInfo|
  thisEntireMatch = dealerInfo[0]
  thisName = dealerInfo[1]
  thisAddress = dealerInfo[2]
  thisMobile = dealerInfo[3]
  thisLink = dealerInfo[4]
  thisEmail = dealerInfo[5]
end

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM