I'm using Nokogiri/Ruby to parse a very large XML document (~300k lines). It's been taking around five minutes to process each record, and I determined that the last line in the code below is taking up 99% of that time. Any suggestions on how to speed up the search? Could it be an issue with system memory (or lack thereof) by any chance?
doc = Nokogiri::XML(File.read(ARGV[0]))
orders = doc.xpath("//order")
order = orders.xpath("//order[account_number=#{sap_account}]")
Try a single XPath using the full path from root instead of //
.
Example:
order = doc.at("/full/path/to/order[account_number=#{sap_account}]")
The //
scans the entire document, so it is the first thing to get rid of when trying to improve performance.
If you really want to speed it up, use the SAX or Reader interfaces.
The Reader interface (as well as SAX) will be faster because it doesn't have to parse the entire document into a DOM; it will simply pass throught the document linearly one node at a time. This gives you speed at the sacrifice of convenience (no querying and no backtracking). Instead, you have to test each node for the conditions you want.
Here's an example using the Reader interface (which is a bit simpler than SAX). Say you have the following file:
<orders>
<order account_number="1">
<item>Foo</item>
</order>
<order account_number="2">
<item>Bar</item>
</order>
<order account_number="3">
<item>Baz</item>
</order>
</orders>
Let's say you want to pull out the <item>
in the order with the account_number
of 2
. Here's the code:
require 'nokogiri'
filename = ARGV[0]
sap_account = "2"
File.open(filename) do |file|
Nokogiri::XML::Reader.from_io(file).each do |node|
if node.name == 'order' and node.attribute('account_number') == sap_account
puts node.inner_xml
end
end
end
Output:
<item>Bar</item>
While it's often useful to break searching for a node, or nodes, into steps, it really looks like you can do this in one:
doc = Nokogiri::XML(File.read(ARGV[0]))
order = doc.xpath("//order[account_number=#{sap_account}]")
If there can only be one occurrence of that node, use:
order = doc.at("//order[account_number=#{sap_account}]")
The difference is that xpath
returns a NodeSet, which is a collection of Nodes. NodeSets support many of the same methods, but they can result in subtle differences because they're being applied to an Array-like structure instead of a single node. at
returns the first matching node, so any further processing you do against the returned Node will only apply to that node and no others.
xpath
is the XPath specific version of search
, with a matching css
method for CSS selectors. search
accepts both CSS and XPath selectors and determines which to use on the fly. Similarly, at
has CSS and XPath corollaries of at_css
and at_xpath
respectively. I tend to use search
and at
and only use the CSS and XPath variants when I the XPath would be mistaken for CSS causing Nokogiri to freak out.
Nokogiri should be pretty fast searching and finding //order[account_number=#{sap_account}]
, even in 300K lines IF it has enough memory to play with.
If it doesn't, then give serious thought to importing the XML into a database and do your searches there. XML isn't really meant to be used as a datastore, so processing against the XML file can go against the flow and make your life harder. Creating the schema and importing it into a database, with indexed fields, can greatly speed up your processing.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.