简体   繁体   中英

How to search a single node, not all nodes

I'm using XPath selectors to select each item on a page (roughly 24) and then I'm using XPath selectors on each item to return values from each one.

Even though I'm running the XPath selectors on the subnode it seems to be searching across all subnodes where I only want it done over each subnode individually.

Here's the code that searches for each item on the doc and then iterates each html_listing . It then passes it to a get_field_data_from :

def get_listing(doc,field_data = {})
  doc.xpath(get_listing_tag[:path]).each do |html_listing|
    fd = get_field_data_from(html_listing,field_data)
    if !field_data &&  fd.detect {|_,data| !data }
      set_uri doc.xpath(get_sub_page_tag[:path])
      get
      fd = get_listing(Nokogiri::HTML(body),fd)
    end
    yield fd
  end
end

So it iterates over all the Fields I'm looking for which is used to retrieve the XPath selector containing strings using

selector = send("get_%s_tag" % field)

If the selector exists and the data has not already been found it will use the XPath selector on the HTML item , store the text using

res[field] = item.xpath(selector[:path]).inner_text

and then return the resulting hash to be used in the next iteration.

def get_field_data_from(item,data)
  Fields.inject(data) do |res,field|
    selector = send("get_%s_tag" % field)
    unless !selector || res[field]
      begin
        res[field] = item.xpath(selector[:path]).inner_text
      rescue Exception => e
        puts "Error for field: %s" % field
        raise e
      end
    end
    res
  end
end

Somehow it seems that doing

res[field] = item.xpath(selector[:path]).inner_text

it seems to search over all the items rather then just that given item listing. I know it's doing that because:

  1. doing:

     puts item.xpath(selector[:path]).inner_text 

    Returns more than one result

  2. I'm not actually looping over all the html_listings. Where it yields the field data yield fd in get_listing I do a break so it only does it once.

I can't seem to figure out what's going on. Does someone else see it?

You need to anchor the XPath queries on the elements:

  • node.xpath("//example") does a global search
  • node.xpath(".//example") does a local search starting at the current node

Notice the leading dot . which anchors the query at the current node. Otherwise the query is run against the root node, even if you call it from the current node.

If you are searching by tag name consider using CSS selectors instead. They have fewer pitfalls than XPath. CSS always searches from the current node.

There's another, equally serious, problem.

item.xpath(selector[:path]).inner_text

xpath returns a NodeSet. inner_text will concatenate the result of all nodes in the NodeSet, resulting in a string that usually won't be what you want.

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>
EOT

doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').inner_text # => "foobar"

Instead you need to use map to walk the list of nodes, then get the text:

doc.search('p').map(&:inner_text) # => ["foo", "bar"]

or, for simplicity:

doc.search('p').map(&:text) # => ["foo", "bar"]

See " How to avoid joining all text from Nodes when scraping " also.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM