I'm using XPath selectors to select each item on a page (roughly 24) and then I'm using XPath selectors on each item to return values from each one.
Even though I'm running the XPath selectors on the subnode it seems to be searching across all subnodes where I only want it done over each subnode individually.
Here's the code that searches for each item on the doc
and then iterates each html_listing
. It then passes it to a get_field_data_from
:
def get_listing(doc,field_data = {})
doc.xpath(get_listing_tag[:path]).each do |html_listing|
fd = get_field_data_from(html_listing,field_data)
if !field_data && fd.detect {|_,data| !data }
set_uri doc.xpath(get_sub_page_tag[:path])
get
fd = get_listing(Nokogiri::HTML(body),fd)
end
yield fd
end
end
So it iterates over all the Fields
I'm looking for which is used to retrieve the XPath selector containing strings using
selector = send("get_%s_tag" % field)
If the selector exists and the data has not already been found it will use the XPath selector on the HTML item
, store the text using
res[field] = item.xpath(selector[:path]).inner_text
and then return the resulting hash to be used in the next iteration.
def get_field_data_from(item,data)
Fields.inject(data) do |res,field|
selector = send("get_%s_tag" % field)
unless !selector || res[field]
begin
res[field] = item.xpath(selector[:path]).inner_text
rescue Exception => e
puts "Error for field: %s" % field
raise e
end
end
res
end
end
Somehow it seems that doing
res[field] = item.xpath(selector[:path]).inner_text
it seems to search over all the items rather then just that given item listing. I know it's doing that because:
doing:
puts item.xpath(selector[:path]).inner_text
Returns more than one result
I'm not actually looping over all the html_listings. Where it yields the field data yield fd
in get_listing
I do a break
so it only does it once.
I can't seem to figure out what's going on. Does someone else see it?
You need to anchor the XPath queries on the elements:
node.xpath("//example")
does a global search node.xpath(".//example")
does a local search starting at the current node Notice the leading dot .
which anchors the query at the current node. Otherwise the query is run against the root node, even if you call it from the current node.
If you are searching by tag name consider using CSS selectors instead. They have fewer pitfalls than XPath. CSS always searches from the current node.
There's another, equally serious, problem.
item.xpath(selector[:path]).inner_text
xpath
returns a NodeSet. inner_text
will concatenate the result of all nodes in the NodeSet, resulting in a string that usually won't be what you want.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').inner_text # => "foobar"
Instead you need to use map
to walk the list of nodes, then get the text:
doc.search('p').map(&:inner_text) # => ["foo", "bar"]
or, for simplicity:
doc.search('p').map(&:text) # => ["foo", "bar"]
See " How to avoid joining all text from Nodes when scraping " also.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.