识别列表中所有可能的父母和孩子

Question

我有80,000个XML文件，据说它们使用相同的格式。 但是，事实显然并非如此。 因此，我试图识别文件中所有现有的节点和子节点。

我已经使用XML包将XML文件作为列表导入，下面，我描述了我的输入和期望的输出。

输入（列表列表）：

XML1 <- list(name = "Company Number 1", 
             adress = list(street = "JP Street", number = "12"), 
             product = "chicken")

XML2 <- list(name = "Company Number 2", 
             company_adress = list(street = "House Street", number = "93"), 
             invoice = list(quantity = "2", product = "phone"))

XML3 <- list(company_name = "Company Number 3", 
             adress = list(street = "Lake Street", number = "1"), 
             invoice = list(quantity = "2", product = "phone", list(note = "Phones are refurbished")))

输出（跨文件的树结构，在叶子处的出现次数）：

List of 5
 $ name          : num 2
 $ company_name  : num 1
 $ adress        :List of 2
  ..$ street: num 2
  ..$ number: num 2
 $ company_adress:List of 2
  ..$ street: num 1
  ..$ number: num 1
 $ invoice       :List of 3
  ..$ quantity: num 2
  ..$ product : num 2
  ..$         :List of 1
  .. ..$ note: num 1
$ product        : num 1

是否有一个软件包可以执行此操作，或者我需要编写一个自己执行的函数？

Answer 1

我编写了一个递归循环来解决该问题。 它不是很优雅，但是可以解决问题。

该函数需要一个嵌套列表和一个空向量。

# Summary tree for storing results
summary_tree <- list()

# Function
tree_merger <- function(tree, position) {
  # Testing if at the leaf of a tree
  if (is.character(tree) | is.null(tree)) {
    print("DONE")
  } else {
    # Position in tree
    if (length(position) == 0) {
      # Names of nodes
      tree_names <- names(tree)

      # Adding one to each name
      for (i in 1:length(tree_names)) {
        if (is.null(summary_tree[[tree_names[i]]])) {
          summary_tree[[tree_names[i]]] <<- list(1)
        } else {
          summary_tree[[tree_names[i]]] <<- list(summary_tree[[tree_names[i]]][[1]] + 1)
        }

        # Running function on new tree
        tree_merger(tree[[tree_names[i]]], c(position, tree_names[i]))
      }
    } else {
      # Names of nodes
      tree_names <- names(tree)

      # Finding position in tree to save information
      position_string <- NULL
      for (p in position) {
        position_string <- paste(position_string, "[[\"", p, "\"]]", sep = "")
      }
      position_string <- paste("summary_tree", position_string, sep = "")

      # Adding one to each position
      for (i in 1:length(tree_names)) {
        position_string_full <<- paste(position_string, "[[\"", tree_names[i], "\"]]", sep = "")

        # Adding to position
        if(is.null(eval(parse(text=position_string_full)))) {
         eval(parse(text=paste(position_string_full, "<<- list(1)")))
        } else {
          eval(parse(text=paste(position_string_full, "<<- list(", position_string_full ,"[[1]] + 1)")))
        }

        # Running function on new tree
        tree_merger(tree[[tree_names[i]]], c(position, tree_names[i]))
      }
    }
  }
}

如果有人遇到相同的问题，则应注意，有关退出递归方式的代码可能已更改。 对于我的XML文件，所有“叶子”都以字符串或NULL结尾。 在列表的其他列表中，它可能是其他类型的值。

识别列表中所有可能的父母和孩子

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-04-25 11:17:22

识别列表中所有可能的父母和孩子

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-04-25 11:17:22

解决方案1
0 已采纳 2017-04-25 11:17:22