简体   繁体   中英

Import XML to R data frame

I am trying to import an xml file into R. It is of the format below with an event on each row followed by a number of attributes - which ones depend on the event type. This file is 0.7GB and future versions may be much bigger. I would like to create a data frame with each event on a new row and all the possible attributes in separate columns (meaning some will be empty depending on the event type). I have looked elsewhere for answers but they all seem to be dealing with XML files in a tree structure and I can't work out how to apply them to this format.

I am new to R and have no experience with XML files so please give me the "for dummies" answer with plenty of explanation. Thanks!

<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
    <event time="21510.0" type="actend" person="3" link="1" actType="h"  />
    <event time="21510.0" type="departure" person="3" link="1" legMode="car"  />
    <event time="21510.0" type="PersonEntersVehicle" person="3" vehicle="3"  />
    <event time="21510.0" type="vehicle enters traffic" person="3" link="1" vehicle="3" networkMode="car" relativePosition="1.0"  />

...

</events>

You can try something like this:

original_xml <- '<?xml version="1.0" encoding="utf-8"?>
    <events version="1.0">
        <event time="21510.0" type="actend" person="3" link="1" actType="h"  />
            <event time="21510.0" type="departure" person="3" link="1" legMode="car"  />
                <event time="21510.0" type="PersonEntersVehicle" person="3" vehicle="3"  />
                    <event time="21510.0" type="vehicle enters traffic" person="3" link="1" vehicle="3" networkMode="car" relativePosition="1.0"  />
                    </events>'
library(xml2)

data2 <- xml_children(read_xml(original_xml))
attr_names <- unique(names(unlist(xml_attrs(data2))))

xmlDataFrame <- as.data.frame(sapply(attr_names, function (attr) {
    xml_attr(data2, attr = attr)
}), stringsAsFactors = FALSE)

#-- since all columns are strings, you may want to turn the numeric columns to numeric

xmlDataFrame[, c("time", "person", "link", "vehicle")] <- sapply(xmlDataFrame[, c("time", "person", "link", "vehicle")], as.numeric)

If you have additional "numeric" columns, you can add them at the end to convert the data to its proper class.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM