简体   繁体   中英

Using loop to extract data from list in R

I am pretty new to R. I scraped a website that required login yesterday, the page is xml format like below.

<result status="success">
  <code>1</code>
  <note>success</note>
  <teacherList>
    <teacher id="D95">
      <name>Mary</name>
      <department id="420">
        <name>Math</name>
      </department>
      <department id="421">
        <name>Statistics</name>
      </department>
    </teacher>
    <teacher id="D73">
      <name>Adam</name>
      <department id="412">
        <name>English</name>
      </department>
    </teacher>
  </teacherList>
</result> 

Recently I just Converted an XML to a list.

library(XML)
library(rvest)
library(plyr)
library(dplyr)
library(httr)
library(pipeR)
library(xml2)

url.address <- "http://xxxxxxxxxxxxxxxxx"
session <-html_session(url.address)
form <-html_form(read_html(url.address))[[1]]
filled_form <- set_values(form,
                          "userid" = "id",
                          "Password" = "password")
s <- submit_form(session,filled_form)
z = read_xml(s$response)
z1 = as_list(z)
z2 <- z1$teacherList

Now I need to extract data from a list and make it as a data frame. By the way, some people belong to 2 departments, but some only belong to 1. A part of the list z2 looks like below:

z2[[1]]

$name
$name[[1]]
[1] "Mary"


$department
$department$name
$department$name[[1]]
[1] "Math"


attr(,"id")
[1] "420"

$department
$department$name
$department$name[[1]]
[1] "statistics"


attr(,"id")
[1] "421"

attr(,"id")
[1] "D95236"

When I extracted them one by one, it took too long:

attr(z2[[1]],"id")

"D95"

z2[[1]][[1]][[1]] 

"Mary"

z2[[1]][[2]][[1]][[1]] 

"Math"

attr(z2[[1]][[2]], "id") 

"420"

z2[[1]][[3]][[1]][[1]] 

"statistics"

attr(z2[[1]][[3]], "id")

"421"

attr(z2[[2]],"id")

"D73"

z2[[2]][[1]][[1]] 

"Adam"

z2[[2]][[2]][[1]][[1]]

"English"

attr(z2[[2]][[2]],"id")

"412"

So I tried to write a loop:

for (x in 1:2){
  for (y in 2:3){
  a <- attr(z2[[x]],"id")
  b <- z2[[x]][[1]][[1]]
  d <- z2[[x]][[y]][[1]][[1]]
  e <- attr(z2[[x]][[y]],"id")
  g <- cbind(print(a),print(b),print(d),print(e))
  }}

but it doesn't work at all since some of the people only belong to one department. The result I expected:

在此处输入图片说明

Any advice would be appreciated!

dput(head(z2, 10))

structure(list(teacher = structure(list(name = list("Mary"), 
    department = structure(list(name = list("Math")), .Names = "name", id = "420"), 
    department = structure(list(name = list("statistics")), .Names = "name", id = "421")), .Names = c("name", 
"department", "department"), id = "D95"), teacher = structure(list(
    name = list("Adam"), department = structure(list(name = list(
        "English")), .Names = "name", id = "412")), .Names = c("name", 
"department"), id = "D73"), teacher = structure(list(name = list(
    "Kevin"), department = structure(list(name = list("Chinese")), .Names = "name", id = "201")), .Names = c("name", 
"department"), id = "D101"), teacher = structure(list(name = list(
    "Nana"), department = structure(list(name = list("Science")), .Names = "name", id = "205")), .Names = c("name", 
"department"), id = "D58"), teacher = structure(list(name = list(
    "Nelson"), department = structure(list(name = list("Music")), .Names = "name", id = "370")), .Names = c("name", 
"department"), id = "D14"), teacher = structure(list(name = list(
    "Esther"), department = structure(list(name = list("Medicine")), .Names = "name", id = "361")), .Names = c("name", 
"department"), id = "D28"), teacher = structure(list(name = list(
    "Mia"), department = structure(list(name = list("Chemistry")), .Names = "name", id = "326")), .Names = c("name", 
"department"), id = "D17"), teacher = structure(list(name = list(
    "Jack"), department = structure(list(name = list("German")), .Names = "name", id = "306")), .Names = c("name", 
"department"), id = "D80"), teacher = structure(list(name = list(
    "Tom"), department = structure(list(name = list("French")), .Names = "name", id = "360")), .Names = c("name", 
"department"), id = "D53"), teacher = structure(list(name = list(
    "Allen"), department = structure(list(name = list("Spanish")), .Names = "name", id = "322")), .Names = c("name", 
"department"), id = "D18")), .Names = c("teacher", "teacher", 
"teacher", "teacher", "teacher", "teacher", "teacher", "teacher", "teacher", 
"teacher"))

This was a bit crazy to construct, but I think it more or less conforms with the desired output posted in a previous version of the post. I had to use sapply within the lapply function to pull out the second ID variable.

do.call(rbind,             # rbind list of data.frames output by lapply
        lapply(unname(z2), # loop through list, first drop outer names
               function(x) { # begin lapply function
                 temp <- unlist(x) # unlist inner elements to a vector
                 data.frame(name=temp[names(temp) == "name"], # subset on names
                            dept=temp[names(temp) == "department.name"], # subset on dept
                            id=attr(x, "id"), # extract one id
                            id2=unlist(sapply(x, attr, "id")), # extract other id
                            row.names=NULL) # end data.frame function, drop row.names
                            })) # end lapply function, lapply, and do.call

this returns

     name       dept   id id2
1    Mary       Math  D95 420
2    Mary statistics  D95 421
3    Adam    English  D73 412
4   Kevin    Chinese D101 201
5    Nana    Science  D58 205
6  Nelson      Music  D14 370
7  Esther   Medicine  D28 361
8     Mia  Chemistry  D17 326
9    Jack     German  D80 306
10    Tom     French  D53 360
11  Allen    Spanish  D18 322

The structure of the second list differs in a number of ways from the initial example. First: one nest is removed. That is, the depth of the new list is one less than that of the initial example. It would be as if you provided z2[[1]] for the initial list. Second, the second example is missing what I called id initially (values such as D95 and D101).

With a bit of manipulation of the original code, I got this to work with

lapply(list(z3), # loop through list, first drop outer names
       function(x) { # begin lapply function
           temp <- unlist(x) # unlist inner elements to a vector
           data.frame(name=temp[names(temp) == "name"], # subset on names
                      dept=temp[names(temp) == "department.name"], # subset on dept
                      # id=attr(x, "id"), # extract one id
                      id2=unlist(sapply(x, attr, "id")), # extract other id
                      row.names=NULL) # end data.frame function, drop row.names
       })

The changes to the code address what I mentioned before z2 is replaced by list(z3) as the first argument to lapply , which constructs the needed list depth. Also, the line of the inner function id=attr(x, "id"), has been commented out as id2 does not exist.

XML is generally really easy to deal with in R

Use library(XML) and library(plyr) to avoid having to write loops:

Step one is to read in the XML

I saved your sample XML as a .xml file called Demo.xml . You can also pass xmlParse a URL.

rawXML <- xmlParse("Demo.xml")

Then convert XML to list:

xmlList <- xmlToList(rawXML)

Then convert list to data frame with plyr

df1 <- ldply(xmlList, data.frame)

This is the general process, if you provide sample data we can refine it to match your specific use case.

Here's the resulting summary output. Is this what you're looking for?

 str(df1)
'data.frame':   4 obs. of  12 variables:
 $ .id                        : chr  "code" "note" "teacherList" ".attrs"
 $ X..i..                     : Factor w/ 2 levels "1","success": 1 2 NA 2
 $ teacher.name               : Factor w/ 1 level "Mary": NA NA 1 NA
 $ teacher.department.name    : Factor w/ 1 level "Math": NA NA 1 NA
 $ teacher.department..attrs  : Factor w/ 1 level "420": NA NA 1 NA
 $ teacher.department.name.1  : Factor w/ 1 level "Statistics": NA NA 1 NA
 $ teacher.department..attrs.1: Factor w/ 1 level "421": NA NA 1 NA
 $ teacher..attrs             : Factor w/ 1 level "D95": NA NA 1 NA
 $ teacher.name.1             : Factor w/ 1 level "Adam": NA NA 1 NA
 $ teacher.department.name.2  : Factor w/ 1 level "English": NA NA 1 NA
 $ teacher.department..attrs.2: Factor w/ 1 level "412": NA NA 1 NA
 $ teacher..attrs.1           : Factor w/ 1 level "D73": NA NA 1 NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM