简体   繁体   English

将多个不均匀的嵌套列表转换成R中的一个DataFrame

[英]Turn Multiple Uneven Nested Lists Into A DataFrame in R

I am trying to get to grips with R and as an experiment I thought that I would try to play around with some cricket data. 我试图与R保持联系,作为一个实验,我认为我会尝试处理一些板球数据。 In its rawest format it is a yaml file, which I used the yaml R package to turn into an R object. 最原始的格式是yaml文件,我使用yaml R包将其转换为R对象。

However, I now have a number of nested lists of uneven length that I want to try and turn into a data frame in R. I have tried a few methods such as writing some loops to parse the data and some of the functions in the tidyr package. 但是,我现在有很多嵌套的不均匀长度的列表,我想尝试将它们转换为R中的数据帧。我尝试了一些方法,例如编写一些循环来解析数据以及tidyr中的某些函数。包。 However, I can't seem to get it to work nicely. 但是,我似乎无法使其正常工作。

I wondered if people knew of the best way to tackle this? 我想知道人们是否知道解决这个问题的最佳方法? Replicating the data structure would be difficult here, because the complexity comes in the multiple nested lists and the unevenness of their length (which would make for a very long code block. However, you can find the raw yaml data here: http://cricsheet.org/downloads/ (I was using the ODI internationals). 在这里复制数据结构将很困难,因为复杂性来自多个嵌套列表以及它们长度的不均匀(这将导致非常长的代码块。但是,您可以在此处找到原始的yaml数据: http:// cricsheet.org/downloads/ (我使用的是ODI国际组织)。

Thanks in advance! 提前致谢!

Update I have tried this: 1)Using tidyr - seperate 更新我已经尝试过:1)使用tidyr-单独

d <- unnest(balls)
Name <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal","WicketFielder","WicketKind","PlayerOut")
a <- separate(d, x, Name, sep = ",",extra = "drop")

Which basically uses the tidyr package returns a single column dataframe that I then try to separate. 基本上使用tidyr包返回一个单列数据帧,然后我尝试将其分离。 However, the problem here is that in the middle there is sometimes extras variables that appear in some rows and not others, thereby throwing off the separation. 但是,这里的问题是,中间有时会有多余的变量出现在某些行中而不是其他行中,从而引发了分离。

2) Creating vectors 2)创建向量

ballsVector <- unlist(balls[[2]],use.names = FALSE)
names_vector <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal")
names(ballsVector) <- c(names_vector)
ballsMatrix <- matrix(ballsVector, nrow = 1, byrow = TRUE)
colnames(ballsMatrix) <- names_vector

The problem here is that the resulting vectors are uneven in length and therefore cant be combined into a data frame. 这里的问题是所得向量的长度不均匀,因此无法合并为数据帧。 It will also suffer from the issue that there are sporadic variables in the middle of the dataset (as above). 数据集中间还会有零星的变量(如上所述),这也将使它受苦。

Caveat: not complete answer; 警告:答案不完整; attempt to arrange the innings data 尝试安排局数

plyr::rbind.fill may offer a solution to binding rows with a different number of columns. plyr::rbind.fill可能提供一种解决方案,以绑定具有不同列数的行。

I dont use tidyr but below is some rough code to get the innings data into a data.frame. 我不使用tidyr但下面是一些将局数数据转换成data.frame的粗略代码。 You could then loop this through all the yaml files in the directory. 然后,您可以遍历目录中的所有yaml文件。

# Download and unzip data
download.file("http://cricsheet.org/downloads/odis.zip", temp<- tempfile())
tmp <- unzip(temp)

# Create lists - use first game
library(yaml)
raw_dat <- yaml.load_file(tmp[[2]])
#names(raw_dat)

# Function to process list into dataframe
p_fun <- function(X) {
          team = X[[1]][["team"]]

          # function to process each list subelement that represents each throw
          fn <- function(...) {
                    tmp = unlist(...)
                    tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
                    colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
                    tmp
                    }
           # loop over all throws
           lst = lapply(X[[1]][["deliveries"]], fn )

           cbind(team, plyr:::rbind.fill(lst))
          }

# Loop over each innings
dat <- plyr::rbind.fill(lapply(raw_dat$innings, p_fun))



Some explanation 一些解释

The list structure and subsetting it. 列表结构并对其进行子设置。 To get an idea of the structure of the list use 要了解列表的结构,请使用

str(raw_dat) # but this gives a really long list of data

You can truncate this, to make it a bit more useful 您可以截断它,使其更有用

str(raw_dat, 3)
length(raw_dat)

So there are three main list elements - meta , info , and innings . 因此,有三个主要的列表元素metainfoinnings You can also see this with 您也可以通过

names(raw_dat)

To access the meta data, you can use 要访问元数据,您可以使用

raw_dat$meta
#or using `[[1]]` to access the first element of the list (see ?'[[')
raw_dat[[1]]
#and get sub-elements by either
raw_dat$meta$data_version
raw_dat[[1]][[1]] # you can also use the names of the list elements eg [[`data_version`]]

The main data is in the innings element. 主要数据在innings元素中。

str(raw_dat$innings, 3)

Look at the names in the list element 查看列表元素中的名称

lapply(raw_dat$innings, names)
lapply(raw_dat$innings[[1]], names)

There are two list elements, each with sub-elements. 有两个列表元素,每个元素都有子元素。 You can access these as 您可以通过以下方式访问它们

raw_dat$innings[[1]][[1]][["team"]] # raw_dat$innings[[1]][["1st innings"]][["team"]]
raw_dat$innings[[2]][[1]][["team"]] # raw_dat$innings[[2]][["2nd innings"]][["team"]]

The above function parsed the deliveries data in raw_dat$innings . 上面的函数解析了raw_dat$innings的交付数据。 To see what it does, work through it from the inside. 要查看其功能,请从内部对其进行遍历。

Use one record to see how it works (note the lapply , with p_fun , looped over raw_dat$innings[[1]] and raw_dat$innings[[2]] ; so this is the outer loop, and the lapply , with fn , loops through the deliveries, within an innings ; the inner loop) 使用一个记录,看看它是如何工作(注意lapply ,与p_fun ,环绕在raw_dat$innings[[1]]raw_dat$innings[[2]]所以这是外循环,而lapply ,与fn ,在一局之内循环传递;内部循环)

X <- raw_dat$innings[[1]] 
tmp <- X[[1]][["deliveries"]][[1]]
tmp

#create a named vector
tmp <- unlist(tmp)
tmp
#      0.1.batsman       0.1.bowler  0.1.non_striker 0.1.runs.batsman  0.1.runs.extras   0.1.runs.total 
#        "IR Bell"       "DW Steyn"       "MJ Prior"              "0"              "0"              "0" 

To use rbind.fill , the elements to bind together need to be data.frames . 要使用rbind.fill ,绑定在一起的元素必须是data.frames We also want to remove the leading numbers / deliveries from the names, as otherwise we will have lots of uniquely names columns 我们还希望从名称中删除前导数字/交货,否则我们将有很多唯一的名称列

# this regex removes all non-numeric characters from the string
# you could then split this number into over and delivery
gsub("[^0-9]", "", names(tmp)) 

# this regex removes all numeric characters from the string -
# allowing consistent names across all the balls / deliveries
# (if i was better at regex I would have also removed the leading dots)
gsub("[0-9]", "", names(tmp))

So for the first delivery in the first innings we have 所以对于第一局的第一次交付

tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
#   ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1   01    IR Bell  DW Steyn       MJ Prior               0              0             0

To see how the lapply works, use the first three deliveries (you will need to run the function fn in your workspace) 要查看lapply工作方式,请使用前三个交付(您将需要在工作区中运行函数fn

lst = lapply(X[[1]][["deliveries"]][1:3], fn )
lst
# [[1]]
#   ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1   01    IR Bell  DW Steyn       MJ Prior               0              0             0
# 
# [[2]]
#   ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1   02    IR Bell  DW Steyn       MJ Prior               0              0             0
# 
# [[3]]
#   ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1   03    IR Bell  DW Steyn       MJ Prior               3              0             3

So we end up with a list element for every delivery within an innings. 因此,我们最终为一局中的每个交付分配了一个list元素。 We then use rbind.fill to create one data.frame. 然后,我们使用rbind.fill创建一个data.frame。


If I was going to try and parse every yaml file I would use a loop. 如果要尝试解析每个yaml文件,我将使用循环。

Use the first three records as an example, and also add the match date. 以前三个记录为例,并添加匹配日期。

tmp <- unzip(temp)[2:4]

all_raw_dat <- vector("list", length=length(tmp))

for(i in seq_along(tmp)) {
      d = yaml.load_file(tmp[i])
      all_raw_dat[[i]] <- cbind(date=d$info$date, plyr::rbind.fill(lapply(d$innings, p_fun)))
}

Then use rbind.fill . 然后使用rbind.fill


Q1. Q1。 from comments 从评论

A small example with rbind.fill 一个带有rbind.fill的小例子

a <- data.frame(x=1, y=2)
b <- data.frame(x=2, z=1)

rbind(a,b) # error as names dont match
plyr::rbind.fill(a, b)

rbind.fill doesnt go back and add/update rows with the extra columns, where needed ( a still doesnt have column z ), Think of it as creating an empty dataframe with the number of columns equal to the number of unique columns found in the list of dataframes - unique(c(names(a), names(b))) . rbind.fill犯规回去与额外的列,在需要的地方(添加/更新行a仍然不必须列z ),把它看成是与在发现等于独特的列数列数创建一个空的数据帧数据帧列表-unique unique(c(names(a), names(b))) The values are then filled in each row where possible, and left missing (NA) otherwise.. 然后,将这些值尽可能填充到每一行中,否则将丢失(NA)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM