简体   繁体   English

如何提取多个 XML 文件的文件属性并将它们与 XML 提取的数据组合(使用 R)

[英]How to extract file properties of multiple XML files and combine them with the XML extracted data (Using R)

I am fairly new to R and need some help to (extract and) combine file names and properties with data extracted from multiple xml files (about 200) which will should then be converted into a dataframe.我对 R 相当陌生,需要一些帮助来(提取和)将文件名和属性与从多个 xml 文件(约 200 个)中提取的数据组合起来,然后将其转换为 Z6A8064B5DF47945550055550055。

I am using the following script to select the xml files, extract the data and convert it into a dataframe (and is working without errors):我正在使用以下脚本对 select xml 文件,提取数据并将其转换为 dataframe (并且工作没有错误):

library(XML)
library(plyr)

# Select multiple xml files within directory
FileName <- list.files(pattern = "xml$",
                       ignore.case=TRUE,
                       full.names = FALSE)

# Create function to extract data
RI_ID <-function(FileName) {
  doc1 <- xmlParse(FileName) 
  doc <- xmlToDataFrame(doc1["//ObjectList[@ObjectType='pkg']/o"], )
} 

# Convert to dataframe
T1 <- ldply(FileName,RI_ID)

# Rename columns
names(T1)[names(T1) == "a"] <- "UniqueInstallationPackageID"
names(T1)[names(T1) == "b"] <- "PackageVersion_Latest"

# Convert to numeric
FieldToNumeric <- c("UniqueInstallationPackageID", "PackageVersion_Latest")
T1[,FieldToNumeric] <- lapply(T1[,FieldToNumeric], as.numeric)

I would like to (and need some help) to:我想(并且需要一些帮助):

  • extract the modified date of the xml file as it appear in windows explorer;提取 xml 文件的修改日期,因为它出现在 windows 资源管理器中;
  • include the file name as well as the modified date as part of the final dataframe.包括文件名以及修改日期作为最终 dataframe 的一部分。

I have reviewed the following two sources, but did not have any success in implementig them:我已经审查了以下两个来源,但在实施它们方面没有任何成功:

Due to a confidentiality agreement, I could not share an example of the xml file, but, if need be, can rename the nodes etc. and submit it.由于保密协议,我无法分享 xml 文件的示例,但如果需要,可以重命名节点等并提交。 Thank you for your help.谢谢您的帮助。

Simply adjust RI_ID method to retrieve those two pieces of information (modified date/time with file.info and FileName variable) and bind those values into new columns of xml data frame.只需调整RI_ID方法以检索这两条信息(使用file.infoFileName变量修改日期/时间)并将这些值绑定到 xml 数据帧的新列中。 Below transform() allows adding columns to a data frame with comma separated assignments:下面的transform()允许使用逗号分隔的赋值向数据框中添加列:

# Create function to extract data
RI_ID <-function(FileName) {
  doc <- xmlParse(FileName) 
  df <- transform(xmlToDataFrame(doc["//ObjectList[@ObjectType='pkg']/o"]),
                  file_name = FileName,
                  file_modified = file.info(FileName)$mtime)
} 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM