從XML屬性到R中的data.frame

Question

我有一個包含這樣的數據的XML：

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" 
       AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27" 
       ViewCount="1647" Body="some text;" OwnerUserId="8" 
       LastActivityDate="2010-09-15T21:08:26.077" 
       Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />
[...]

（數據集是來自stats.stackexchange.com的轉儲）

如何獲取具有“Id”和“PostTypeId”屬性的data.frame？

我一直在嘗試使用XML庫，但我發現我不知道如何解開值：

library(XML)

xml <- xmlTreeParse("Posts.xml",useInternalNode=TRUE)
types <- getNodeSet(xml, '//row/@PostTypeId')

> types[1]
[[1]]
PostTypeId 
       "1" 
attr(,"class")
[1] "XMLAttributeValue"

將XML中的這兩列投影到data.frame中的正確R方法是什么？

Answer 1

使用rvest （它是xml2的包裝器），你可以這樣做：

require(rvest)
require(magrittr)
doc <- xml('<posts>
  <row Id="1" PostTypeId="1" 
AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27" 
ViewCount="1647" Body="some text;" OwnerUserId="8" 
LastActivityDate="2010-09-15T21:08:26.077" 
Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />
</posts>')

rows <- doc %>% xml_nodes("row")
data.frame(
  Id = rows %>% xml_attr("id"),
  PostTypeId = rows %>% xml_attr("posttypeid")
)

導致：

  Id PostTypeId
1  1          1

如果你采用Comments.xml
同

data.frame(
  Id = rows %>% xml_attr("id"),
  PostTypeId = rows %>% xml_attr("postid"),
  score = rows %>% xml_attr("score")
)

你收到：

> head(dat)
  Id PostTypeId score
1  1          3     5
2  2          5     0
3  3          9     0
4  4          5    11
5  5          3     1
6  6         14     9

Answer 2

這實際上是XML包中xmlEventParse函數的一個很好的用例。 這是一個200多MB的文件，你要做的最后一件事是不必要地浪費內存（XML解析是眾所周知的內存密集型的），浪費時間多次通過節點。

通過使用xmlEventParse您還可以過濾您執行或不需要的操作，還可以在其中獲取進度條，以便您可以看到正在發生的事情。

library(XML)
library(data.table)

# get the # of <rows> quickly; you can approximate if you don't know the
# number or can't run this and then chop down the size of the data.frame
# afterwards
system("grep -c '<row' ~/Desktop/p1.xml")
## 128010

n <- 128010

# pre-populate a data.frame
# you could also just write this data out to a file and read it back in
# which would negate the need to use global variables or pre-allocate
# a data.frame
dat <- data.frame(id=rep(NA_character_, n),
                  post_type_id=rep(NA_character_, n),
                  stringsAsFactors=FALSE)

# setup a progress bar since there are alot of nodes
pb <- txtProgressBar(min=0, max=n, style=3)

# this function will be called for each <row>
# again, you could write to a file/database/whatever vs do this
# data.frame population
idx <- 1
process_row <- function(node, tribs) {
  # update the progress bar
  setTxtProgressBar(pb, idx)
  # get our data (you can filter here)
  dat[idx, "id"] <<- tribs["Id"]
  dat[idx, "post_type_id"] <<- tribs["PostTypeId"]
  # update the index
  idx <<- idx + 1
}

# start the parser
info <- xmlEventParse("Posts.xml", list(row=process_row))

# close up the progress bar
close(pb)

head(dat)
##   id post_type_id
## 1  1            1
## 2  2            1
## 3  3            1
## 4  4            1
## 5  5            2
## 6  6            1

Answer 3

比其他答案容易一點：

require(xml2)
read_xml('Posts.xml') -> doc
xml_children(doc) -> rows
data.frame(
   Id = as.numeric(xml_attr(rows,"Id"))
  ,PostTypeId = as.numeric(xml_attr(rows,"PostTypeId"))
) -> df

沒有rvest / magrittr包，只有xml2
將帶有數字的字符串轉換為數字

從XML屬性到R中的data.frame

問題描述

3 個解決方案

解決方案1
4 已采納 2015-10-01 21:18:27

解決方案2
3 2015-10-01 21:55:08

解決方案3
0 2018-12-15 22:32:53

從XML屬性到R中的data.frame

問題描述

3 個解決方案

解決方案1 4 已采納 2015-10-01 21:18:27

解決方案2 3 2015-10-01 21:55:08

解決方案3 0 2018-12-15 22:32:53

解決方案1
4 已采納 2015-10-01 21:18:27

解決方案2
3 2015-10-01 21:55:08

解決方案3
0 2018-12-15 22:32:53