简体   繁体   中英

From XML attributes to data.frame in R

I have a XML that contains data like this:

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" 
       AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27" 
       ViewCount="1647" Body="some text;" OwnerUserId="8" 
       LastActivityDate="2010-09-15T21:08:26.077" 
       Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />
[...]

(The dataset is a dump from stats.stackexchange.com )

How to get a data.frame with the attributes "Id" and "PostTypeId"?

I have been trying with the XML library but I get to a point where I don't know how to unwrap the values:

library(XML)

xml <- xmlTreeParse("Posts.xml",useInternalNode=TRUE)
types <- getNodeSet(xml, '//row/@PostTypeId')

> types[1]
[[1]]
PostTypeId 
       "1" 
attr(,"class")
[1] "XMLAttributeValue"

Which would be the proper R way of getting a projection of those two columns from the XML into a data.frame?

Using rvest (which is a wrapper around xml2 ) you can do it as follows:

require(rvest)
require(magrittr)
doc <- xml('<posts>
  <row Id="1" PostTypeId="1" 
AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27" 
ViewCount="1647" Body="some text;" OwnerUserId="8" 
LastActivityDate="2010-09-15T21:08:26.077" 
Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />
</posts>')

rows <- doc %>% xml_nodes("row")
data.frame(
  Id = rows %>% xml_attr("id"),
  PostTypeId = rows %>% xml_attr("posttypeid")
)

Resulting in:

  Id PostTypeId
1  1          1

If you take Comments.xml
with

data.frame(
  Id = rows %>% xml_attr("id"),
  PostTypeId = rows %>% xml_attr("postid"),
  score = rows %>% xml_attr("score")
)

You receive:

> head(dat)
  Id PostTypeId score
1  1          3     5
2  2          5     0
3  3          9     0
4  4          5    11
5  5          3     1
6  6         14     9

This is actually a great use-case for the xmlEventParse function in the XML package. This is a 200+ MB file and the last thing you want to do is waste memory needlessly (XML parsing is notoriously memory intensive) and waste time going through nodes multiple times.

By using xmlEventParse you can also filter what you do or do not need and you can also get a progress bar snuck in there so you can see what's going on.

library(XML)
library(data.table)

# get the # of <rows> quickly; you can approximate if you don't know the
# number or can't run this and then chop down the size of the data.frame
# afterwards
system("grep -c '<row' ~/Desktop/p1.xml")
## 128010

n <- 128010

# pre-populate a data.frame
# you could also just write this data out to a file and read it back in
# which would negate the need to use global variables or pre-allocate
# a data.frame
dat <- data.frame(id=rep(NA_character_, n),
                  post_type_id=rep(NA_character_, n),
                  stringsAsFactors=FALSE)

# setup a progress bar since there are alot of nodes
pb <- txtProgressBar(min=0, max=n, style=3)

# this function will be called for each <row>
# again, you could write to a file/database/whatever vs do this
# data.frame population
idx <- 1
process_row <- function(node, tribs) {
  # update the progress bar
  setTxtProgressBar(pb, idx)
  # get our data (you can filter here)
  dat[idx, "id"] <<- tribs["Id"]
  dat[idx, "post_type_id"] <<- tribs["PostTypeId"]
  # update the index
  idx <<- idx + 1
}

# start the parser
info <- xmlEventParse("Posts.xml", list(row=process_row))

# close up the progress bar
close(pb)

head(dat)
##   id post_type_id
## 1  1            1
## 2  2            1
## 3  3            1
## 4  4            1
## 5  5            2
## 6  6            1

A little easier than the other answer:

require(xml2)
read_xml('Posts.xml') -> doc
xml_children(doc) -> rows
data.frame(
   Id = as.numeric(xml_attr(rows,"Id"))
  ,PostTypeId = as.numeric(xml_attr(rows,"PostTypeId"))
) -> df
  1. no rvest / magrittr packages, only xml2
  2. convert strings with numbers to numeric

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM