R中多个csv的xml文件

Question

I found this question and hrbrmstr's answer: " In R, how to extracting two values from XML file, looping over 5603 files and write to table " ... which works for example with the Crude-dataset, but with my own dataset I get an error: Error in ans[[1]] : subscript out of bounds 我找到了这个问题，并找到了hrbrmstr的答案：“ 在R中，如何从XML文件中提取两个值，循环5603个文件并写入表中 ”……例如，它可以与Crude数据集一起使用，但是对于我自己的数据集，我得到了错误：ans [[1]]错误：下标超出范围

setwd("LOCATION_OF_XML_FILES")

xmlfiles <- list.files(pattern = "*.xml")

dat <- ldply(seq(xmlfiles), function(i){
  doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
  teksti <- xmlValue(doc[["//body"]])
  file <- unlist(strsplit(xmlfiles[i],split=".",fixed=T))[1]
  return(data.frame(file,teksti)) 
})

head(dat)

write.csv(dat, "tekstit_xml.csv", row.names=FALSE)

My dataset is confidential so I'm afraid I can't share it, but the structure is like this: 我的数据集是机密的，所以我怕不能共享它，但是结构是这样的：

<?xml version="1.0" encoding="UTF-8"?>
-<article> <body> flajslkfjlkjaslkjflkajlskjfasjdfjflkdsjalfjdsj 
"alot of text, like a chapter of a book"
 </body> </article>

If I take away the "teksti <- xmlValue(doc[["//body"]])", then the code works, but when it is included I get an error: 如果我删除了“ teksti <-xmlValue（doc [[“” // body“]]））”，那么该代码有效，但是当包含该代码时，我会得到一个错误：

Error in ans[[1]] : subscript out of bounds ans [[1]]中的错误：下标超出范围

Can You please help me? 你能帮我么？

EDIT: I tried it with 11 files and everything went well. 编辑：我尝试了11个文件，一切顺利。 But with the 530 xml:s it still gives the error. 但是使用530 xml：s仍然会给出错误。 The largest files have about 5000 words in them. 最大的文件中包含大约5000个单词。 So is it so that data.frame has a limit to it's size? 那么data.frame是否有大小限制？

Traceback: 追溯：

 Error in ans[[1]] : subscript out of bounds 

 8 `[[.XMLInternalDocument`(doc, "//body") 

 7 doc[["//body"]] 

 6 xmlValue(doc[["//body"]]) 

 5 FUN(X[[12L]], ...) 

 4 lapply(pieces, .fun, ...) 

 3 structure(lapply(pieces, .fun, ...), dim = dim(pieces)) 

 2 llply(.data = .data, .fun = .fun, ..., .progress = .progress, 
 .inform = .inform, .parallel = .parallel, .paropts = .paropts) 

 1 ldply(seq(xmlfiles), function(i) {
   doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
   teksti <- xmlValue(doc[["//body"]])
   file <- unlist(strsplit(xmlfiles[i], split = ".", fixed = T))[1] ...

Answer 1

One of your files is missing the "body" tag 您的文件之一缺少“ body”标签

xmlValue(doc[["//bodyy"]])
Error in ans[[1]] : subscript out of bounds

You can use xpathSApply instead and get an empty list when the tag is missing 您可以改用xpathSApply并在缺少标签时获取一个空列表

xpathSApply(doc, "//bodyy", xmlValue)
list()

and then add checks to your code to skip writing to a data.frame... 然后在代码中添加检查以跳过写入data.frame ...

dat <- ldply(seq(xmlfiles), function(i){
  doc <- xmlParse(xmlfiles[i])
  teksti <- xpathSApply(doc, "//body", xmlValue)
  if(length(teksti)==0){
      print(paste("Warning: no body tag in", xmlfiles[i], i))
      teksti <- NA
  }
 file <- unlist(strsplit(xmlfiles[i],split=".",fixed=T))[1]
  return(data.frame(file,teksti)) 

})

R中多个csv的xml文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2014-11-18 22:26:32

R中多个csv的xml文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2014-11-18 22:26:32

解决方案1
0 已采纳 2014-11-18 22:26:32