简体   繁体   中英

Why is XPath returning value of '0' using Ruby, Nokogiri and Watir?

I'm working on a white-hat web-crawler that will periodically log into my account and check some information for me using Ruby with Watir and Nokogiri.

Here's the simplified HTML I'm trying to pull information from:

 <div class="navbar navbar-default navbar-fixed-top hidden-lg hidden-md" style="z-index: 1002"> <div class="banner-g"> <div class="container"> <div id="user-info"> <div id="acct-value"> <a href="https://www.testsite.org/Profile/MyShares" title="Change in value of your shares">GAIN/LOSS <span class="SPShares">-$12.85</span></a> </div> <div id="committed"> <a href="https://www.testsite.org/Profile/MyShares" title="Amount paid for your shares">INVESTED <span class="SPPortfolio">$152.11</span></a> </div> <div id="avail"> <a href="https://www.testsite.org/Profile/MyShares">AVAILABLE <span class="SPBalance">$26.98</span></a> </div> 

I'm trying to pull the $26.98. at the bottom of the excerpt.

Here are three snippets of code I'm using. They're all pretty much identical except for the XPath. The first two return their values perfectly, but the third always returns a value of "0" even though it 'should' return "$26.98" or "26.98".

 val_one = page_html.xpath(".//*[@id='openone']/div/div[2]/div[1]/div/div[2]/table/tbody/tr[2]/td[1]").text.gsub(/\D/,'').to_i

 val_two = page_html.xpath(".//*[@id='opentwo']/div/div[2]/div[2]/div/div[2]/table/tbody/tr[2]/td[1]").text.gsub(/\D/,'').to_i

 val_three = page_html.xpath(".//*[@id='avail']/a/span").text.gsub(/\D/,'').to_i
 puts val_three

I assume it's a problem with the XPath, but I've gone through dozens of XPath troubleshooting questions here and none have worked. I checked the XPath with both FirePath and "XPath Checker". I also tried having the XPath search for the "SPBalance" class but that gave the same result.

When I remove to.i from the end, it returns a blank line instead of a zero.

Elsewhere in the site when using Watir, I was able to fix problems recording a value by calling .focus , but for this piece of the code, which is more Nokogiri, using .focus causes the error message:

undefined method `focus' for []:Nokogiri::XML::NodeSet (NoMethodError)

I assume .focus doesn't work for Nokogiri.

Update: Replaced HTML with a cleaner/more complete version.

I've continued to play around with different ways of reaching that data cell, including xpath, css and a search method. Someone told me xpath wouldn't work for this page so I spent even more time trying to get css to work. Someone else told me the page had Javascript, which would prevent Watir from working. So I tried rewriting the app for Selenium instead. Selenium did not solve the problem, and created a whole host of other problems.

Update: After following advice from the Tin Man, I've found that the node is not actually visible in the HTML when it is downloaded using curl.

I'm now trying to access the node using Watir instead of Nokogiri (as he suggested). Here's some of what I've tried so far:

avail_funds = browser.span :class => 'SPBalance'
avail_funds.exists?
avail_funds.text

avail_funds = browser.span(:css, 'span[customattribute]').text
avail_funds = browser.div(:id => "avail").a(:href => "/Profile/MyShares").span(:class => "SPBalance").text
avail_funds = browser.span(:xpath, ".//*[@id='avail']/a/span").text
avail_funds = browser.span(:css, 'span[class="SPBalance"]').text
avail_funds = browser.span.text
avail_funds = browser.div.text

browser.span(:class, "SPBalance").focus
avail_funds = browser.span(:class, "SPBalance").text 

avail_funds = @browser.span(:class => 'SPBalance').inner_html
puts @browser.spans(:class => "SPBalance")
puts @browser.span(:class => "SPBalance")

texts = @browser.spans(:class => "SPBalance").map do |span|
  span.text
end

So far all of the above return either blank lines or an error message.

The div class with the ID "user-info" is visible within the HTML as downloaded via curl. Everything beneath that, however, is not visible.

When I try:

avail_funds = browser.div(:id => "user-info").text

I get only blank lines.

When I try:

avail_funds = browser.div(:class => "navbar navbar-default navbar-fixed-top hidden-xs hidden-sm").text

I get actual text back! But unfortunately the string does not contain the value I want.

I also tried:

puts browser.html

Because I thought if the value where visible in that version of the HTML, as it is through my Firefox plug-in, I could parse down to the value I want. But unfortunately the value is not visible in that version of the HTML.

By first 2 commands you fetch data directly from table cell beginning from the root of the document, and in the last one you starting from the center.

Try out to give span id and get data again, and then grow up the complexity and you will find your error in xpath

The first problem is you're trying to use a long, too-long, selector that is referencing tags that don't exist:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<head>
<body class="cbp-spmenu-push">
<div id="FreshWidget" class="freshwidget-container responsive" data-html2canvas-ignore="true" style="display: none;">
<div id="freshwidget-button" class="freshwidget-button fd-btn-right" data-html2canvas-ignore="true" style="display: none; top: 235px;">
<link rel="stylesheet" href="/Content/css/NavPushComponent.css"/>
<script src="/Scripts/classie.js"/>
<script src="/Scripts/modernizr.custom.js"/>
<div class="navbar navbar-default navbar-fixed-top hidden-lg hidden-md" style="z-index: 1002">
<div class="banner-g">
<div class="container">
<div id="user-info">
<div id="acct-value">
<div id="committed">
<div id="avail">
<a href="/Profile/MyBalance">
AVAILABLE 
<span class="SPBalance">$31.59</span>
EOT

doc.at('tbody') # => nil
 ".//*[@id='openone']/div/div[2]/div[1]/div/div[2]/table/tbody/tr[2]/td[1]" ".//*[@id='opentwo']/div/div[2]/div[2]/div/div[2]/table/tbody/tr[2]/td[1]" 

There is no <tbody> tag in your sample, and there rarely is in HTML created in the wild, especially if people created it manually. We usually see <tbody> in HTML someone grabbed from a browser's "View Source" display, which is the resulting output after their engine has mangled the HTML in an attempt to make it readable. Don't use that output. Instead, ALWAYS go straight to the source and use wget or curl and download the page and inspect it with an editor, or even use nokogiri some_url on the command-line and look at it there.

A second problem is your HTML snippet is invalid because it's full of unterminated tags. Nokogiri will do fixups on bad HTML, which can actually move nodes around, making it difficult to find nodes, especially when debugging. In this particular case Nokogiri is able to terminate them, but it's important to honor tag closures.

Here's what I'd use:

value = doc.at('span.SPBalance').text # => "$31.59"

This is using CSS which is usually much more readable than XPath. at means "find the first occurrence" and is equivalent to search('span.SPBalance').first .

The XPath equivalent would be:

doc.at('//span[@class="SPBalance"]')
doc.at('//span[@class="SPBalance"]').text # => "$31.59"

Once I have the value then it's easy to manipulate it.

value[/[\d.]+/].to_f # => 31.59

Moving on...

the third always returns a value of "0" even though it should return "$31.59" or "31.59"

'$31.58'.to_i # => 0
'$'.to_i # => 0
'31.58'.to_i # => 31
'$31.58'.to_f # => 0.0
'31.58'.to_f # => 31.58

The documentation for to_f and to_i say respectively:

Returns the result of interpreting leading characters in str as a floating point number.

and

Returns the result of interpreting leading characters in str as an integer base base (between 2 and 36).

In both cases "leading characters" is significant.


using .focus causes the error message:

  undefined method `focus' for []:Nokogiri::XML::NodeSet (NoMethodError) 

I assume .focus doesn't work for Nokogiri.

You could always check the NodeSet documentation , which confirms that focus is not a method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM