I am pretty new to R. I scraped a website that required login yesterday, the page is xml format like below.
<result status="success">
<code>1</code>
<note>success</note>
<teacherList>
<teacher id="D95">
<name>Mary</name>
<department id="420">
<name>Math</name>
</department>
<department id="421">
<name>Statistics</name>
</department>
</teacher>
<teacher id="D73">
<name>Adam</name>
<department id="412">
<name>English</name>
</department>
</teacher>
</teacherList>
</result>
Recently I just Converted an XML to a list.
library(XML)
library(rvest)
library(plyr)
library(dplyr)
library(httr)
library(pipeR)
library(xml2)
url.address <- "http://xxxxxxxxxxxxxxxxx"
session <-html_session(url.address)
form <-html_form(read_html(url.address))[[1]]
filled_form <- set_values(form,
"userid" = "id",
"Password" = "password")
s <- submit_form(session,filled_form)
z = read_xml(s$response)
z1 = as_list(z)
z2 <- z1$teacherList
Now I need to extract data from a list and make it as a data frame. By the way, some people belong to 2 departments, but some only belong to 1. A part of the list z2 looks like below:
z2[[1]]
$name
$name[[1]]
[1] "Mary"
$department
$department$name
$department$name[[1]]
[1] "Math"
attr(,"id")
[1] "420"
$department
$department$name
$department$name[[1]]
[1] "statistics"
attr(,"id")
[1] "421"
attr(,"id")
[1] "D95236"
When I extracted them one by one, it took too long:
attr(z2[[1]],"id")
"D95"
z2[[1]][[1]][[1]]
"Mary"
z2[[1]][[2]][[1]][[1]]
"Math"
attr(z2[[1]][[2]], "id")
"420"
z2[[1]][[3]][[1]][[1]]
"statistics"
attr(z2[[1]][[3]], "id")
"421"
attr(z2[[2]],"id")
"D73"
z2[[2]][[1]][[1]]
"Adam"
z2[[2]][[2]][[1]][[1]]
"English"
attr(z2[[2]][[2]],"id")
"412"
So I tried to write a loop:
for (x in 1:2){
for (y in 2:3){
a <- attr(z2[[x]],"id")
b <- z2[[x]][[1]][[1]]
d <- z2[[x]][[y]][[1]][[1]]
e <- attr(z2[[x]][[y]],"id")
g <- cbind(print(a),print(b),print(d),print(e))
}}
but it doesn't work at all since some of the people only belong to one department. The result I expected:
Any advice would be appreciated!
dput(head(z2, 10))
structure(list(teacher = structure(list(name = list("Mary"),
department = structure(list(name = list("Math")), .Names = "name", id = "420"),
department = structure(list(name = list("statistics")), .Names = "name", id = "421")), .Names = c("name",
"department", "department"), id = "D95"), teacher = structure(list(
name = list("Adam"), department = structure(list(name = list(
"English")), .Names = "name", id = "412")), .Names = c("name",
"department"), id = "D73"), teacher = structure(list(name = list(
"Kevin"), department = structure(list(name = list("Chinese")), .Names = "name", id = "201")), .Names = c("name",
"department"), id = "D101"), teacher = structure(list(name = list(
"Nana"), department = structure(list(name = list("Science")), .Names = "name", id = "205")), .Names = c("name",
"department"), id = "D58"), teacher = structure(list(name = list(
"Nelson"), department = structure(list(name = list("Music")), .Names = "name", id = "370")), .Names = c("name",
"department"), id = "D14"), teacher = structure(list(name = list(
"Esther"), department = structure(list(name = list("Medicine")), .Names = "name", id = "361")), .Names = c("name",
"department"), id = "D28"), teacher = structure(list(name = list(
"Mia"), department = structure(list(name = list("Chemistry")), .Names = "name", id = "326")), .Names = c("name",
"department"), id = "D17"), teacher = structure(list(name = list(
"Jack"), department = structure(list(name = list("German")), .Names = "name", id = "306")), .Names = c("name",
"department"), id = "D80"), teacher = structure(list(name = list(
"Tom"), department = structure(list(name = list("French")), .Names = "name", id = "360")), .Names = c("name",
"department"), id = "D53"), teacher = structure(list(name = list(
"Allen"), department = structure(list(name = list("Spanish")), .Names = "name", id = "322")), .Names = c("name",
"department"), id = "D18")), .Names = c("teacher", "teacher",
"teacher", "teacher", "teacher", "teacher", "teacher", "teacher", "teacher",
"teacher"))
This was a bit crazy to construct, but I think it more or less conforms with the desired output posted in a previous version of the post. I had to use sapply
within the lapply
function to pull out the second ID variable.
do.call(rbind, # rbind list of data.frames output by lapply
lapply(unname(z2), # loop through list, first drop outer names
function(x) { # begin lapply function
temp <- unlist(x) # unlist inner elements to a vector
data.frame(name=temp[names(temp) == "name"], # subset on names
dept=temp[names(temp) == "department.name"], # subset on dept
id=attr(x, "id"), # extract one id
id2=unlist(sapply(x, attr, "id")), # extract other id
row.names=NULL) # end data.frame function, drop row.names
})) # end lapply function, lapply, and do.call
this returns
name dept id id2
1 Mary Math D95 420
2 Mary statistics D95 421
3 Adam English D73 412
4 Kevin Chinese D101 201
5 Nana Science D58 205
6 Nelson Music D14 370
7 Esther Medicine D28 361
8 Mia Chemistry D17 326
9 Jack German D80 306
10 Tom French D53 360
11 Allen Spanish D18 322
The structure of the second list differs in a number of ways from the initial example. First: one nest is removed. That is, the depth of the new list is one less than that of the initial example. It would be as if you provided z2[[1]] for the initial list. Second, the second example is missing what I called id initially (values such as D95 and D101).
With a bit of manipulation of the original code, I got this to work with
lapply(list(z3), # loop through list, first drop outer names
function(x) { # begin lapply function
temp <- unlist(x) # unlist inner elements to a vector
data.frame(name=temp[names(temp) == "name"], # subset on names
dept=temp[names(temp) == "department.name"], # subset on dept
# id=attr(x, "id"), # extract one id
id2=unlist(sapply(x, attr, "id")), # extract other id
row.names=NULL) # end data.frame function, drop row.names
})
The changes to the code address what I mentioned before z2 is replaced by list(z3)
as the first argument to lapply
, which constructs the needed list depth. Also, the line of the inner function id=attr(x, "id"),
has been commented out as id2 does not exist.
XML is generally really easy to deal with in R
Use library(XML)
and library(plyr)
to avoid having to write loops:
Step one is to read in the XML
I saved your sample XML as a .xml file called Demo.xml
. You can also pass xmlParse a URL.
rawXML <- xmlParse("Demo.xml")
Then convert XML to list:
xmlList <- xmlToList(rawXML)
Then convert list to data frame with plyr
df1 <- ldply(xmlList, data.frame)
This is the general process, if you provide sample data we can refine it to match your specific use case.
Here's the resulting summary output. Is this what you're looking for?
str(df1)
'data.frame': 4 obs. of 12 variables:
$ .id : chr "code" "note" "teacherList" ".attrs"
$ X..i.. : Factor w/ 2 levels "1","success": 1 2 NA 2
$ teacher.name : Factor w/ 1 level "Mary": NA NA 1 NA
$ teacher.department.name : Factor w/ 1 level "Math": NA NA 1 NA
$ teacher.department..attrs : Factor w/ 1 level "420": NA NA 1 NA
$ teacher.department.name.1 : Factor w/ 1 level "Statistics": NA NA 1 NA
$ teacher.department..attrs.1: Factor w/ 1 level "421": NA NA 1 NA
$ teacher..attrs : Factor w/ 1 level "D95": NA NA 1 NA
$ teacher.name.1 : Factor w/ 1 level "Adam": NA NA 1 NA
$ teacher.department.name.2 : Factor w/ 1 level "English": NA NA 1 NA
$ teacher.department..attrs.2: Factor w/ 1 level "412": NA NA 1 NA
$ teacher..attrs.1 : Factor w/ 1 level "D73": NA NA 1 NA
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.