简体   繁体   中英

Unable to scrape data from a particular webpage using Watir

I've used Watir to successfully web scrape before, but I'm having trouble web scraping this particular web page.

https://kroger.softcoin.com/programs/kroger/digital_coupons/?banner=Smiths&origin=DigitalCoupons

When I visit the page on a regular browser, I can see the page reload itself two or three times every time and I'm thinking that's where the problem comes from. I've tried using

Watir::Wait.until { @browser.div(id: "offer-105653").visible? }

but that doesn't work. I've successfully used the code above on other webpages as a test, but it doesn't seem to work for the Kroger website. I am not sure how to fix.

def save
    require 'watir'
    require 'phantomjs'

    @browser = Watir::Browser.new:phantomjs
    @browser.goto "https://kroger.softcoin.com/programs/kroger/digital_coupons/?banner=Smiths&origin=DigitalCoupons"
    @browser.li(id: "1768173").wait_until(&:present?).text
    @products = @browser.divs
    @products.each do |x|
        Smith.create(title: x.text) 
    end

end

#visible? makes the assumption that the element first exists. If it does not exist in the DOM yet it will immediately raise an exception rather than continuing to wait, so it is usually not what you want to use when polling for an element.

Try:

@browser.div(id: "offer-105653").wait_until(&:present?).text`

Potentially what is happening is that behind the scenes webdriver or watir are trying to use CSS selectors to select that element.

The thing is, ID values starting with a number were disallowed under HTML4, but are now allowed in HTML5. However, despite being allowed in HTML5, CSS Selection does not allow selecting by IDs starting with a number unless you get tricky. For that to work you have to escape the first character.

You can see this in the developer console if you navigate to that page and issue a command such as $$("#\\\\31 755189") it will find the element just fine. But if you try $$("#1768173") you will see an invalid selector error. (Note those examples are likely to be valid only for a short period, as that is a dynamic page subject to change

I would recommend trying the following in your code and see if that gets it working.

@browser.li(id: "\\31 768173").wait_until(&:present?).text

If that does work, then to allow it to work without escaping the first numeral the watir devs may have to add special case logic to selecting by ID to escape the first character when it is a numeral

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM