Extremely slow xpath search (ruby/nokogiri)

Question

I'm using Nokogiri/Ruby to parse a very large XML document (~300k lines). It's been taking around five minutes to process each record, and I determined that the last line in the code below is taking up 99% of that time. Any suggestions on how to speed up the search? Could it be an issue with system memory (or lack thereof) by any chance?

doc = Nokogiri::XML(File.read(ARGV[0]))
orders = doc.xpath("//order")

order = orders.xpath("//order[account_number=#{sap_account}]")

Answer 1

A quick fix

Try a single XPath using the full path from root instead of // .

Example:

order = doc.at("/full/path/to/order[account_number=#{sap_account}]")

The // scans the entire document, so it is the first thing to get rid of when trying to improve performance.

If you really want to speed it up, use the SAX or Reader interfaces.

Real speed: the Reader interface

The Reader interface (as well as SAX) will be faster because it doesn't have to parse the entire document into a DOM; it will simply pass throught the document linearly one node at a time. This gives you speed at the sacrifice of convenience (no querying and no backtracking). Instead, you have to test each node for the conditions you want.

Here's an example using the Reader interface (which is a bit simpler than SAX). Say you have the following file:

<orders>
  <order account_number="1">
    <item>Foo</item>
  </order>
  <order account_number="2">
    <item>Bar</item>
  </order>
  <order account_number="3">
    <item>Baz</item>
  </order>
</orders>

Let's say you want to pull out the <item> in the order with the account_number of 2 . Here's the code:

require 'nokogiri'
filename = ARGV[0]
sap_account = "2"

File.open(filename) do |file|
  Nokogiri::XML::Reader.from_io(file).each do |node|
    if node.name == 'order' and node.attribute('account_number') == sap_account
      puts node.inner_xml
    end
  end
end

Output:

<item>Bar</item>

Answer 2

While it's often useful to break searching for a node, or nodes, into steps, it really looks like you can do this in one:

doc = Nokogiri::XML(File.read(ARGV[0]))
order = doc.xpath("//order[account_number=#{sap_account}]")

If there can only be one occurrence of that node, use:

order = doc.at("//order[account_number=#{sap_account}]")

The difference is that xpath returns a NodeSet, which is a collection of Nodes. NodeSets support many of the same methods, but they can result in subtle differences because they're being applied to an Array-like structure instead of a single node. at returns the first matching node, so any further processing you do against the returned Node will only apply to that node and no others.

xpath is the XPath specific version of search , with a matching css method for CSS selectors. search accepts both CSS and XPath selectors and determines which to use on the fly. Similarly, at has CSS and XPath corollaries of at_css and at_xpath respectively. I tend to use search and at and only use the CSS and XPath variants when I the XPath would be mistaken for CSS causing Nokogiri to freak out.

Nokogiri should be pretty fast searching and finding //order[account_number=#{sap_account}] , even in 300K lines IF it has enough memory to play with.

If it doesn't, then give serious thought to importing the XML into a database and do your searches there. XML isn't really meant to be used as a datastore, so processing against the XML file can go against the flow and make your life harder. Creating the schema and importing it into a database, with indexed fields, can greatly speed up your processing.

Extremely slow xpath search (ruby/nokogiri)

Question

2 answers

solution1
3 ACCPTED 2013-11-08 01:52:43

A quick fix

Real speed: the Reader interface

solution2
1 2013-11-08 02:41:42

Extremely slow xpath search (ruby/nokogiri)

Question

2 answers

solution1 3 ACCPTED 2013-11-08 01:52:43

A quick fix

Real speed: the Reader interface

solution2 1 2013-11-08 02:41:42

solution1
3 ACCPTED 2013-11-08 01:52:43

solution2
1 2013-11-08 02:41:42