使用Nokogiri解析大型XML

Question

So I'm attempting to parse a 400k+ line XML file using Nokogiri. 因此，我尝试使用Nokogiri解析一个超过40万行的XML文件。

The XML file has this basic format: XML文件具有以下基本格式：

<?xml version="1.0" encoding="windows-1252"?>
<JDBOR date="2013-09-01 04:12:31" version="1.0.20 [2012-12-14]" copyright="Orphanet (c) 2013">
 <DisorderList count="6760">

  *** Repeated Many Times ***
  <Disorder id="17601">
  <OrphaNumber>166024</OrphaNumber>
  <Name lang="en">Multiple epiphyseal dysplasia, Al-Gazali type</Name>
  <DisorderSignList count="18">
    <DisorderSign>
      <ClinicalSign id="2040">
        <Name lang="en">Macrocephaly/macrocrania/megalocephaly/megacephaly</Name>
      </ClinicalSign>
      <SignFreq id="640">
        <Name lang="en">Very frequent</Name>
      </SignFreq>
    </DisorderSign>
  </Disorder>
  *** Repeated Many Times ***

 </DisorderList>
</JDBOR>

Here is the code I've created to parse and return each DisorderSign id and name into a database: 这是我创建的用于解析并将每个DisorderSign ID和名称返回到数据库的代码：

require 'nokogiri'

sympFile = File.open("Temp.xml")
@doc = Nokogiri::XML(sympFile)
sympFile.close()
symptomsList = []

@doc.xpath("////DisorderSign").each do |x|
    signId = x.at('ClinicalSign').attribute('id').text()      
    name = x.at('ClinicalSign').element_children().text()
    symptomsList.push([signId, name])
end

symptomsList.each do |x|
    Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end

This works perfect on the test files I've used, although they were much smaller, around 10000 lines. 这对我使用的测试文件非常有效，尽管它们很小，大约10000行。

When I attempt to run this on the large XML file, it simply does not finish. 当我尝试在大型XML文件上运行该文件时，它只是没有完成。 I left it on overnight and it seemed to just lockup. 我把它留了一夜，它似乎只是锁死了。 Is there any fundamental reason the code I've written would make this very memory intensive or inefficient? 我编写的代码是否有任何根本原因会使此内存占用过多或效率低下？ I realize I store every possible pair in a list, but that shouldn't be large enough to fill up memory. 我意识到我将所有可能的配对存储在一个列表中，但是那应该不足以填满内存。

Thank you for any help. 感谢您的任何帮助。

Answer 1

I see a few possible problems. 我看到了一些可能的问题。 First of all, this: 首先，这是：

@doc = Nokogiri::XML(sympFile)

will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file. 会将整个XML文件作为某种libxml2数据结构插入到内存中，并且可能会比原始XML文件大。

Then you do things like this: 然后，您将执行以下操作：

@doc.xpath(...).each

That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet that xpath returns. 生成一个仅维护指向XML内部形式的指针的枚举器可能不够聪明，它可能会在构建xpath返回的NodeSet时生成所有内容的副本。 That would give you another copy of most of the expanded-in-memory version of the XML. 这将为您提供XML的大多数在内存中扩展版本的副本。 I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything. 我不确定这里会发生多少复制和数组构建，但是即使不复制所有内容，也有相当大的内存和CPU开销。

Then you make your copy of what you're interested in: 然后复制您感兴趣的内容：

symptomsList.push([signId, name])

and finally iterate over that array: 最后遍历该数组：

symptomsList.each do |x|
    Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end

I find that SAX parsers work better with large data sets but they are more cumbersome to work with. 我发现SAX解析器在处理大型数据集时效果更好，但使用起来比较麻烦。 You could try creating your own SAX parser something like this: 您可以尝试创建自己的SAX解析器，如下所示：

class D < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [ ])
    if(name == 'DisorderSign')
      @data = { }
    elsif(name == 'ClinicalSign')
      @key        = :sign
      @data[@key] = ''
    elsif(name == 'SignFreq')
      @key        = :freq
      @data[@key] = ''
    elsif(name == 'Name')
      @in_name = true
    end
  end

  def characters(str)
    @data[@key] += str if(@key && @in_name)
  end

  def end_element(name, attrs = [ ])
    if(name == 'DisorderSign')
      # Dump @data into the database here.
      @data = nil
    elsif(name == 'ClinicalSign')
      @key = nil
    elsif(name == 'SignFreq')
      @key = nil
    elsif(name == 'Name')
      @in_name = false
    end
  end
end

The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. 结构应该非常清楚：观察您感兴趣的元素的打开情况，并在执行操作时进行一些簿记设置，然后在您关心的元素内缓存字符串，最后在元素关闭时清理并处理数据。 You're database work would replace the 您的数据库工作将取代

# Dump @data into the database here.

comment. 评论。

This structure makes it pretty easy to watch for the <Disorder id="17601"> elements so that you can keep track of how far you've gone. 这种结构使您很容易观察<Disorder id="17601">元素，以便您可以跟踪已走的距离。 That way you can stop and restart the import with some small modifications to your script. 这样，您可以通过对脚本进行一些小的修改来停止并重新启动导入。

Answer 2

A SAX Parser is definitly what you want to be using. SAX解析器绝对是您要使用的。 If you're anything like me and can't jive with the Nokogiri documentation, there is an awesome gem called Saxerator that makes this process really easy. 如果您和我一样，但是不喜欢Nokogiri文档，那么有个很棒的工具Saxerator可以使此过程变得非常简单。

An example for what you are trying to do -- 您尝试做的事的一个例子-

require 'saxerator'

parser = Saxerator.parser(Temp.xml)


parser.for_tag(:DisorderSign).each do |sign|
  signId = sign[:ClinicalSign][:id]
  name = sign[:ClinicalSign][:name]
  Symtom(:name => name, :id => signId).create!
end

Answer 3

You're likely running out of memory because symptomsList is getting too large in memory size. 您可能会用完内存，这是因为symptomsList内存大小变得太大。 Why not perform the SQL within the xpath loop? 为什么不在xpath循环中执行SQL？

require 'nokogiri'

sympFile = File.open("Temp.xml")
@doc = Nokogiri::XML(sympFile)
sympFile.close()

@doc.xpath("////DisorderSign").each do |x|
  signId = x.at('ClinicalSign').attribute('id').text()      
  name = x.at('ClinicalSign').element_children().text()
  Symptom.where(:name => name, :signid => signId.to_i).first_or_create
end

It's possible too that the file is just too large for the buffer to handle. 文件也可能太大，缓冲区无法处理。 In that case you could chop it up into smaller temp files and process them individually. 在这种情况下，您可以将其切成较小的临时文件，然后分别进行处理。

Answer 4

You can also use Nokogiri::XML::Reader . 您也可以使用Nokogiri::XML::Reader 。 It's more memory intensive that Nokogiri::XML::SAX parser but you can keep XML structure, ex Nokogiri::XML::SAX解析器需要更多的内存，但是您可以保留XML结构，例如

class NodeHandler < Struct.new(:node)
  def process
    # Node processing logic
    #e.x.
    signId = node.at('ClinicalSign').attribute('id').text()      
    name = node.at('ClinicalSign').element_children().text()

  end
end


Nokogiri::XML::Reader(File.open('./test/fixtures/example.xml')).each do |node|
  if node.name == 'DisorderSign' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
    NodeHandler.new(
        Nokogiri::XML(node.outer_xml).at('./DisorderSign')
    ).process
  end
end

Based on this blog 根据此博客

使用Nokogiri解析大型XML

问题描述

4 个解决方案

解决方案1
4 已采纳 2013-11-09 02:38:19

解决方案2
3 2014-01-30 16:39:31

解决方案3
2 2013-11-08 21:51:28

解决方案4
0 2016-06-07 10:44:26

使用Nokogiri解析大型XML

问题描述

4 个解决方案

解决方案1 4 已采纳 2013-11-09 02:38:19

解决方案2 3 2014-01-30 16:39:31

解决方案3 2 2013-11-08 21:51:28

解决方案4 0 2016-06-07 10:44:26

解决方案1
4 已采纳 2013-11-09 02:38:19

解决方案2
3 2014-01-30 16:39:31

解决方案3
2 2013-11-08 21:51:28

解决方案4
0 2016-06-07 10:44:26