简体   繁体   English

如何读取R中的大(~20 GB)xml文件?

[英]How to read large (~20 GB) xml file in R?

I want to read data from large xml file (20 GB) and manipulate them. 我想从大型xml文件(20 GB)中读取数据并对其进行操作。 I tired to use "xmlParse()" but it gave me memory issue before loading. 我厌倦了使用“xmlParse()”,但它在加载前给了我内存问题。 Is there any efficient way to do this? 有没有有效的方法来做到这一点?

My data dump looks like this, 我的数据转储看起来像这样,

<tags>                                                                                                    
    <row Id="106929" TagName="moto-360" Count="1"/>
    <row Id="106930" TagName="n1ql" Count="1"/>
    <row Id="106931" TagName="fable" Count="1" ExcerptPostId="25824355" WikiPostId="25824354"/>
    <row Id="106932" TagName="deeplearning4j" Count="1"/>
    <row Id="106933" TagName="pystache" Count="1"/>
    <row Id="106934" TagName="jitter" Count="1"/>
    <row Id="106935" TagName="klein-mvc" Count="1"/>
</tags>

In XML package the xmlEventParse function implements SAX (reading XML and calling your function handlers). 在XML包中, xmlEventParse函数实现SAX(读取XML并调用函数处理程序)。 If your XML is simple enough (repeating elements inside one root element), you can use branches parameter to define function(s) for every element. 如果您的XML足够简单(在一个根元素内重复元素),您可以使用branches参数为每个元素定义函数。

Example: 例:

MedlineCitation = function(x, ...) {
  #This is a "branch" function
  #x is a XML node - everything inside element <MedlineCitation>
  # find element <ArticleTitle> inside and print it:
  ns <- getNodeSet(x,path = "//ArticleTitle")
  value <- xmlValue(ns[[1]])
  print(value)
}

Call XML parsing: 调用XML解析:

xmlEventParse(
  file = "http://www.nlm.nih.gov/databases/dtd/medsamp2015.xml", 
  handlers = NULL, 
  branches = list(MedlineCitation = MedlineCitation)
)

Solution with closure: 关闭解决方案:

Like in Martin Morgan, Storing-specific-xml-node-values-with-rs-xmleventparse : 与Martin Morgan一样, Storing-specific-xml-node-values-with-rs-xmleventparse

branchFunction <- function() {
  store <- new.env() 
  func <- function(x, ...) {
    ns <- getNodeSet(x, path = "//ArticleTitle")
    value <- xmlValue(ns[[1]])
    print(value)
    # if storing something ... 
    # store[[some_key]] <- some_value
  }
  getStore <- function() { as.list(store) }
  list(MedlineCitation = func, getStore=getStore)
}

myfunctions <- branchFunction()

xmlEventParse(
  file = "medsamp2015.xml", 
  handlers = NULL, 
  branches = myfunctions
)

#to see what is inside
myfunctions$getStore()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM