简体   繁体   中英

Lapply to the list of data frames in R

> df1 <- data.frame(A = 1:10, B= 11:20)
> df2 <- data.frame(A = 21:30, B = 31:40)
> ddata <- list(df1,df2)

My objective is to perform correlation of A column and B column per data frame of the list. ie

cor (ddata[[1]]$A,ddata[[1]]$B)
cor (ddata[[2]]$A,ddata[[2]]$B)

for this I am using lapply but I am doing something incorrect, please help.

lapply(ddata, cor)

The issue with your code is that when you call cor on a whole data.frame (of all numeric columns), it will return a correlation matrix , containing the pairwise correlations of all columns - with the values on the diagonals being the respective column's correlation with itself (which is always equal to 1.00). This wouldn't be immediately apparent with your sample data, since cor(A,B) == cor(B,A) == cor(A,A) == cor(B,B) == 1 for your two data.frame s. This is clearer in the following example:

df5 <- data.frame(A=rnorm(10),B=rnorm(10),C=rnorm(10))
R> cor(df5)
           A           B          C
A 1.00000000  0.05131293  0.6173047
B 0.05131293  1.00000000 -0.1312331
C 0.61730466 -0.13123314  1.0000000

Regardless, I think you were looking for a single correlation value rather than a correlation matrix , which can be achieved a couple of different ways - accessing the data.frame 's columns using either x[,1] & x[,2] or using x[[1]] & x[[2]] .

Additionally, there is another syntax option; one which results in a scalar value for correlation, except unlike the two cases above, it preserves the matrix class. This is accessing the columns using x[1] & x[2] , since the single brackets (with no comma) will yield a one column data.frame .

For your purposes, any of the 3 methods noted directly above should be acceptable - as long as you pass cor two objects, whether they are (atomic) numeric vectors (case [, ] and case [[ ]] ) or single column data.frame s (case [ ] ) - the function will evaluate as cor(x, y, ...) and return a single correlation value . The (subtle) difference between the first two methods and the third method is the class of the return value - numeric (atomic) for the former, and matrix for the latter - but this is most likely an inconsequential detail in the big picture.


Let me summarize this with a couple of examples, using this data:

set.seed(123)
df3 <- data.frame(
  A=rnorm(10),
  B=rnorm(10))
##
set.seed(321)
df4 <- data.frame(
  A=rnorm(10),
  B=rnorm(10))
##
dflist <- list(df3,df4)

A. Result type is a correlation matrix; result class is matrix :

R> class(cor(df3)); cor(df3)
[1] "matrix"
          A         B
A 1.0000000 0.5776151
B 0.5776151 1.0000000

B. Result type is a single correlation value; result class is matrix :

R> class(cor(df3[1],df3[2])); cor(df3[1],df3[2])
[1] "matrix"
          B
A 0.5776151

C. Result type is a single correlation value; result class is numeric :

R> class(cor(df3[,1],df3[,2])); cor(df3[,1],df3[,2])
[1] "numeric"
[1] 0.5776151

D. Result type is a single correlation value; result class is numeric :

R> class(cor(df3[[1]],df3[[2]])); cor(df3[[1]],df3[[2]])
[1] "numeric"
[1] 0.5776151

Similarly, the following four functions fA - fD correspond to the cases A - D described above:

fA <- function(y) {
  res <- lapply(y,cor)
  message(paste0("Element class: ",class(res[[1]])))
  res
}
##
fB <- function(y) {
  res <- lapply(y, function(x) {
    cor(x[1],x[2])
  })
  message(paste0("Element class: ",class(res[[1]])))
  res
}
##
fC <- function(y) {
  res <- lapply(y, function(x) {
    cor(x[,1],x[,2])
  })
  message(paste0("Element class: ",class(res[[1]])))
  res
}
##
fD <- function(y) {
  res <- lapply(y, function(x) {
    cor(x[[1]],x[[2]])
  })
  message(paste0("Element class: ",class(res[[1]])))
  res
}

And running them on the object dflist gives us

R> fA(dflist)
Element class: matrix
[[1]]
          A         B
A 1.0000000 0.5776151
B 0.5776151 1.0000000

[[2]]
           A          B
A  1.0000000 -0.1816951
B -0.1816951  1.0000000

##
R> fB(dflist)
Element class: matrix
[[1]]
          B
A 0.5776151

[[2]]
           B
A -0.1816951

##
R> fC(dflist)
Element class: numeric
[[1]]
[1] 0.5776151

[[2]]
[1] -0.1816951

##
R> fD(dflist)
Element class: numeric
[[1]]
[1] 0.5776151

[[2]]
[1] -0.1816951

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM