简体   繁体   English

使用循环从R中的列表中提取数据

[英]Using loop to extract data from list in R

I am pretty new to R. I scraped a website that required login yesterday, the page is xml format like below. 我对R很陌生。我昨天刮了一个要求登录的网站,该页面是xml格式,如下所示。

<result status="success">
  <code>1</code>
  <note>success</note>
  <teacherList>
    <teacher id="D95">
      <name>Mary</name>
      <department id="420">
        <name>Math</name>
      </department>
      <department id="421">
        <name>Statistics</name>
      </department>
    </teacher>
    <teacher id="D73">
      <name>Adam</name>
      <department id="412">
        <name>English</name>
      </department>
    </teacher>
  </teacherList>
</result> 

Recently I just Converted an XML to a list. 最近,我刚刚将XML转换为列表。

library(XML)
library(rvest)
library(plyr)
library(dplyr)
library(httr)
library(pipeR)
library(xml2)

url.address <- "http://xxxxxxxxxxxxxxxxx"
session <-html_session(url.address)
form <-html_form(read_html(url.address))[[1]]
filled_form <- set_values(form,
                          "userid" = "id",
                          "Password" = "password")
s <- submit_form(session,filled_form)
z = read_xml(s$response)
z1 = as_list(z)
z2 <- z1$teacherList

Now I need to extract data from a list and make it as a data frame. 现在,我需要从列表中提取数据并将其作为数据框。 By the way, some people belong to 2 departments, but some only belong to 1. A part of the list z2 looks like below: 顺便说一句,有些人属于2个部门,但有些人仅属于1个部门。列表z2的一部分如下所示:

z2[[1]]

$name
$name[[1]]
[1] "Mary"


$department
$department$name
$department$name[[1]]
[1] "Math"


attr(,"id")
[1] "420"

$department
$department$name
$department$name[[1]]
[1] "statistics"


attr(,"id")
[1] "421"

attr(,"id")
[1] "D95236"

When I extracted them one by one, it took too long: 当我一一提取它们时,花费的时间太长:

attr(z2[[1]],"id")

"D95" “ D95”

z2[[1]][[1]][[1]] 

"Mary" “玛丽”

z2[[1]][[2]][[1]][[1]] 

"Math" “数学”

attr(z2[[1]][[2]], "id") 

"420" “ 420”

z2[[1]][[3]][[1]][[1]] 

"statistics" “统计”

attr(z2[[1]][[3]], "id")

"421" “ 421”

attr(z2[[2]],"id")

"D73" “ D73”

z2[[2]][[1]][[1]] 

"Adam" “亚当”

z2[[2]][[2]][[1]][[1]]

"English" “英语”

attr(z2[[2]][[2]],"id")

"412" “ 412”

So I tried to write a loop: 所以我试图写一个循环:

for (x in 1:2){
  for (y in 2:3){
  a <- attr(z2[[x]],"id")
  b <- z2[[x]][[1]][[1]]
  d <- z2[[x]][[y]][[1]][[1]]
  e <- attr(z2[[x]][[y]],"id")
  g <- cbind(print(a),print(b),print(d),print(e))
  }}

but it doesn't work at all since some of the people only belong to one department. 但这根本不起作用,因为有些人只属于一个部门。 The result I expected: 我预期的结果:

在此处输入图片说明

Any advice would be appreciated! 任何意见,将不胜感激!

dput(head(z2, 10))

structure(list(teacher = structure(list(name = list("Mary"), 
    department = structure(list(name = list("Math")), .Names = "name", id = "420"), 
    department = structure(list(name = list("statistics")), .Names = "name", id = "421")), .Names = c("name", 
"department", "department"), id = "D95"), teacher = structure(list(
    name = list("Adam"), department = structure(list(name = list(
        "English")), .Names = "name", id = "412")), .Names = c("name", 
"department"), id = "D73"), teacher = structure(list(name = list(
    "Kevin"), department = structure(list(name = list("Chinese")), .Names = "name", id = "201")), .Names = c("name", 
"department"), id = "D101"), teacher = structure(list(name = list(
    "Nana"), department = structure(list(name = list("Science")), .Names = "name", id = "205")), .Names = c("name", 
"department"), id = "D58"), teacher = structure(list(name = list(
    "Nelson"), department = structure(list(name = list("Music")), .Names = "name", id = "370")), .Names = c("name", 
"department"), id = "D14"), teacher = structure(list(name = list(
    "Esther"), department = structure(list(name = list("Medicine")), .Names = "name", id = "361")), .Names = c("name", 
"department"), id = "D28"), teacher = structure(list(name = list(
    "Mia"), department = structure(list(name = list("Chemistry")), .Names = "name", id = "326")), .Names = c("name", 
"department"), id = "D17"), teacher = structure(list(name = list(
    "Jack"), department = structure(list(name = list("German")), .Names = "name", id = "306")), .Names = c("name", 
"department"), id = "D80"), teacher = structure(list(name = list(
    "Tom"), department = structure(list(name = list("French")), .Names = "name", id = "360")), .Names = c("name", 
"department"), id = "D53"), teacher = structure(list(name = list(
    "Allen"), department = structure(list(name = list("Spanish")), .Names = "name", id = "322")), .Names = c("name", 
"department"), id = "D18")), .Names = c("teacher", "teacher", 
"teacher", "teacher", "teacher", "teacher", "teacher", "teacher", "teacher", 
"teacher"))

This was a bit crazy to construct, but I think it more or less conforms with the desired output posted in a previous version of the post. 构建起来有点疯狂,但是我认为它或多或少与该帖子的先前版本中发布的期望输出一致。 I had to use sapply within the lapply function to pull out the second ID variable. 我必须在lapply函数中使用sapply来拉出第二个ID变量。

do.call(rbind,             # rbind list of data.frames output by lapply
        lapply(unname(z2), # loop through list, first drop outer names
               function(x) { # begin lapply function
                 temp <- unlist(x) # unlist inner elements to a vector
                 data.frame(name=temp[names(temp) == "name"], # subset on names
                            dept=temp[names(temp) == "department.name"], # subset on dept
                            id=attr(x, "id"), # extract one id
                            id2=unlist(sapply(x, attr, "id")), # extract other id
                            row.names=NULL) # end data.frame function, drop row.names
                            })) # end lapply function, lapply, and do.call

this returns 这回来

     name       dept   id id2
1    Mary       Math  D95 420
2    Mary statistics  D95 421
3    Adam    English  D73 412
4   Kevin    Chinese D101 201
5    Nana    Science  D58 205
6  Nelson      Music  D14 370
7  Esther   Medicine  D28 361
8     Mia  Chemistry  D17 326
9    Jack     German  D80 306
10    Tom     French  D53 360
11  Allen    Spanish  D18 322

The structure of the second list differs in a number of ways from the initial example. 第二个列表的结构与初始示例在许多方面有所不同。 First: one nest is removed. 第一:移除一个巢。 That is, the depth of the new list is one less than that of the initial example. 也就是说,新列表的深度比初始示例的深度小一。 It would be as if you provided z2[[1]] for the initial list. 好像您为初始列表提供了z2 [[1]]。 Second, the second example is missing what I called id initially (values such as D95 and D101). 其次,第二个示例最初缺少我所谓的id(诸如D95和D101之类的值)。

With a bit of manipulation of the original code, I got this to work with 通过对原始代码的一些操作,我将其与

lapply(list(z3), # loop through list, first drop outer names
       function(x) { # begin lapply function
           temp <- unlist(x) # unlist inner elements to a vector
           data.frame(name=temp[names(temp) == "name"], # subset on names
                      dept=temp[names(temp) == "department.name"], # subset on dept
                      # id=attr(x, "id"), # extract one id
                      id2=unlist(sapply(x, attr, "id")), # extract other id
                      row.names=NULL) # end data.frame function, drop row.names
       })

The changes to the code address what I mentioned before z2 is replaced by list(z3) as the first argument to lapply , which constructs the needed list depth. 对代码地址的更改将我在z2之前提到的内容替换为list(z3)作为lapply的第一个参数,从而构造了所需的列表深度。 Also, the line of the inner function id=attr(x, "id"), has been commented out as id2 does not exist. 另外,内部函数id=attr(x, "id"),已被注释掉,因为id2不存在。

XML is generally really easy to deal with in R XML通常在R中很容易处理

Use library(XML) and library(plyr) to avoid having to write loops: 使用library(XML)library(plyr)避免编写循环:

Step one is to read in the XML 第一步是读取XML

I saved your sample XML as a .xml file called Demo.xml . 我将示例XML保存为名为Demo.xml的.xml文件。 You can also pass xmlParse a URL. 您还可以传递xmlParse URL。

rawXML <- xmlParse("Demo.xml")

Then convert XML to list: 然后将XML转换为列表:

xmlList <- xmlToList(rawXML)

Then convert list to data frame with plyr 然后使用plyr将列表转换为数据框

df1 <- ldply(xmlList, data.frame)

This is the general process, if you provide sample data we can refine it to match your specific use case. 这是常规过程,如果您提供示例数据,我们可以对其进行优化以匹配您的特定用例。

Here's the resulting summary output. 这是结果摘要输出。 Is this what you're looking for? 这是您要找的东西吗?

 str(df1)
'data.frame':   4 obs. of  12 variables:
 $ .id                        : chr  "code" "note" "teacherList" ".attrs"
 $ X..i..                     : Factor w/ 2 levels "1","success": 1 2 NA 2
 $ teacher.name               : Factor w/ 1 level "Mary": NA NA 1 NA
 $ teacher.department.name    : Factor w/ 1 level "Math": NA NA 1 NA
 $ teacher.department..attrs  : Factor w/ 1 level "420": NA NA 1 NA
 $ teacher.department.name.1  : Factor w/ 1 level "Statistics": NA NA 1 NA
 $ teacher.department..attrs.1: Factor w/ 1 level "421": NA NA 1 NA
 $ teacher..attrs             : Factor w/ 1 level "D95": NA NA 1 NA
 $ teacher.name.1             : Factor w/ 1 level "Adam": NA NA 1 NA
 $ teacher.department.name.2  : Factor w/ 1 level "English": NA NA 1 NA
 $ teacher.department..attrs.2: Factor w/ 1 level "412": NA NA 1 NA
 $ teacher..attrs.1           : Factor w/ 1 level "D73": NA NA 1 NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM