简体   繁体   中英

Ruby LibXML skip large nodes

I have an xml file that has a very large text node (>10 MB). While reading the file, is it possible to skip (ignore) this node?

I tried the following:

 reader = XML::Reader.io(path)
 while reader.read do
  next if reader.name.eql?('huge-node')
 end

But this still results in the error parser error : xmlSAX2Characters: huge text node

The only other solution I can think of is to first read the file as a string and remove the huge node through a gsub, and then parse the file. However, this method seems very inefficient.

That's probably because by the time you are trying to skip it, it's already read the node. According to the documentation for the #read method:

reader.read -> nil|true|false
Causes the reader to move to the next node in the stream, exposing its properties.

Returns true if a node was successfully read or false if there are no more nodes to read. On errors, an exception is raised.

You would need to skip the node prior to calling the #read method on it. I'm sure there are many ways you could do that but it doesn't look like this library supports XPath expressions, or I would suggest something like that.

EDIT: The question was clarified so that the SAX parser is a required part of the solution. I have removed links that would not be helpful given this constraint.

You don't have to skip the node. The cause is that since version 2.7.3 libxml limits the maximum size of a single text node to 10MB. This limit can be removed with a new option, XML_PARSE_HUGE.

Bellow an example:

# Reads entire file into a string
$result = file_get_contents("https://www.ncbi.nlm.nih.gov/gene/68943?report=xml&format=text");
# Returns the xml string into an object
$xml = simplexml_load_string($result, 'SimpleXMLElement', LIBXML_COMPACT | LIBXML_PARSEHUGE);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM