简体   繁体   中英

R XML - cannot remove internal C nodes from memory

I have to parse ~2000 xml documents, extract certain nodes from each doc, add them to a single document, and save. I am using internal C nodes so that I can use XPath. The problem is that as I loop over the document I cannot remove the internal C objects from memory, ending up with >4GB of used memory. I know that the problem is not with the loaded tree (I ran the loop with just loading and deleting the hash tree for each document), but with the filtered nodes or the root node.

Here is the code I am using. What am I missing so I can clear the memory at the end of each iteration?

xmlDoc <- xmlHashTree()
rootNode <- newXMLNode("root")

for (i in seq_along(all.docs)){

  # Read in the doc, filter out nodes, remove temp doc
  temp.xml <- xmlParse(all.docs[i])
  filteredNodes <- newXMLNode(all.docs[i],
                   xpathApply(temp.xml,"//my.node[@my.attr='my.value'"))
  free(temp.xml)
  rm(temp.xml)

  # Add filtered nodes to root node and get rid of them.
  addChildren(rootNode, filteredNodes)
  removeNodes(filteredNodes, free = TRUE)
  rm(filteredNodes)

}
# Add root node to doc and save that new log.
xmlDoc <- addChildren(root)
saveXML(xmlDoc, "MergedDocs.xml") 

Thank you for your help

So I have found no way to do it using 'XML' without memory leaks and a lot of processing time. Luckily 'xml2' can handle creating documents and nodes now. For completeness sake, here is the solution using 'xml2'. If anyone knows of a way using 'XML', do chime in.

xmlDoc <- xml_new_document() %>% xml_add_child("root")

for (i in seq_along(all.docs)){
 # Read in the log.
 rawXML <- read_xml(all.docs[i])

 # Filter relevant nodes and cast them to a list of children.
 tempNodes   <- xml_find_all(rawXML, "//my.node[@my.attr='my.value'")
 theChildren <- xml_children(tempNodes)

 # Get rid of the temp doc.
 rm(rawXML)

 # Add the filtered nodes to the log under a node named after the file name
 xmlDoc %>%
  xml_add_child(all.docs[i]  %>%
  xml_add_child(theChildren[[1]]) %>%
  invisible()

 # Remove the temp objects
 rm(tempNodes); rm(theChildren)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM