简体   繁体   中英

R loop through columns in list of data frames

Suppose the following data frame (in reality my data frame has thousands of rows):

year<-c(2010,2010,2010,2011,2011,2011,2012,2012,2013,2013)
a1<-rnorm(10)
a2<-rnorm(10)
b1<-rnorm(10)
b2<-rnorm(10)
c1<-rnorm(10)
c2<-rnorm(10)

I used the following code to create a list consisting of multiple data frames, which splits the original data frame into subsets by year.

#split datasets into years
df.list<-split(df, df$year)

#Name of datasets df plus year
dfnames <- str_c("df", names(df.list))
names(df.list)<-dfnames

I want to apply the following loop to all data frames of the list:

#df_target is a new data frame that stores the results and j is the indicator for it:
df_target <- NULL
j <- 1

for(i in seq(2, 7, 2)) {
  df_target[[j]] <- (df[i]*df[i+1])/(sum(df[i+1]))
  j <- j+1
  }
}

The code works fine for one data frame, however, I want to split the data frame into multiple data frames grouped by year and then loop through the columns.

Thus, I use the following function to apply the loop mentioned above to all data frames from the list:

df_target <- NULL
j <- 1

fnc <- function(x){
  for(i in seq(2, 7, 2)) {
  df_target[[j]] <- (x[i]*x[i+1])/(sum(x[i+1]))
  j <- j+1
  }
}

sapply(df.list, fnc)

With this code, I don't get any error messages, however both data frames from the list are NULL. What exactly am I doing wrong?

df_target should be a data frame containing columns a_new= (a1 a2)/sum(a2), b_new= (b1 b2)/sum(b2) and c_new= (c1*c2)/sum(c2) but for each year separately.

You need to define j and df_target inside the function, and set what should it return (as it is now, it makes the calculation of df_target , but doesn't return's it):

fnc <- function(x){
  df_target <- NULL
  j <- 1
  for(i in seq(2, 7, 2)) {
  df_target[[j]] <- (x[i]*x[i+1])/(sum(x[i+1]))
  j <- j+1
  }
  return(df_target)
}

But keep in mind that this will output a matrix of lists, as for each element of df.list that sapply will select, you'll be creating a 3 element list of df_target , so the output will look like this in the console:

> sapply(df.list, fnc)
     df2010 df2011 df2012 df2013
[1,] List,1 List,1 List,1 List,1
[2,] List,1 List,1 List,1 List,1
[3,] List,1 List,1 List,1 List,1

But will be this:

在此处输入图像描述

To get a cleaner output, we can set df_target to create a data frame with the values from each year:

fnc <- function(x){
  df_target <- as.data.frame(matrix(nrow=nrow(x), ncol=3))
  for(i in seq(2, 7, 2)) {
    df_target[,i/2] <- (x[i]*x[i+1])/(sum(x[i+1]))
  }
return(df_target)}

This returns a df per year, but if we use sapply we'll get a similar output of matrix of lists, so its better to define the function to already loop trough every year:

fnc <- function(y){
  df_target.list <- list()
  k=1
  for(j in y){
    df_target <- as.data.frame(matrix(nrow=nrow(j), ncol=3))
    for(i in seq(2, 7, 2)) {
      df_target[,i/2] <- (j[i]*j[i+1])/(sum(j[i+1]))
    }
    df_target.list[[names(y)[k]]] = df_target
    k=k+1
  }
  return(df_target.list)}

Output:

> fnc(df.list)
$df2010
           V1         V2          V3
1 -0.10971160 0.01688244 -0.16339367
2  0.05440564 0.57554210 -0.06803244
3  0.03185178 0.90598561 -0.68692401

$df2011
           V1           V2         V3
1 -0.43090055  0.007152131  0.3930606
2  0.15050644  0.329092942 -0.1367295
3  0.07336839 -0.423631930 -0.1504056

$df2012
         V1         V2         V3
1 0.5540294  0.4561862 0.09169914
2 0.1153931 -1.1311450 0.81853691

$df2013
          V1        V2        V3
1  0.4322934 0.5286973 0.2136495
2 -0.2412705 0.1316942 0.1455196

Here is a tidyverse solution. Try running this bit by bit so you can see what it does.

First it adds the rowid as a column to make sure unique rows can be identified later. Then it reshapes the data using pivot_longer to put the data into long format, and then pivot_wider to partially reverse this. Then the data are grouped and the calculation run. This is running a loop internally.

library(tidyverse)
set.seed(123)
tibble(
  year = c(2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2013, 2013),
  a1 = rnorm(10),
  a2 = rnorm(10),
  b1 = rnorm(10),
  b2 = rnorm(10),
  c1 = rnorm(10),
  c2 = rnorm(10)
) %>% 
  rowid_to_column() %>% 
  pivot_longer(cols = -c(year, rowid), names_to = c("nameA", "name12"), names_pattern = "(\\w)(\\d)" ) %>% 
  pivot_wider(names_from = name12, values_from = value) %>% 
  group_by(nameA) %>% 
  mutate(j = `1` * `2` / (sum(`2`)))
#> # A tibble: 30 x 6
#> # Groups:   nameA [3]
#>    rowid  year nameA     `1`     `2`        j
#>    <int> <dbl> <chr>   <dbl>   <dbl>    <dbl>
#>  1     1  2010 a     -0.560   1.22   -0.329  
#>  2     1  2010 b     -1.07    0.426  -0.141  
#>  3     1  2010 c     -0.695   0.253  -0.0794 
#>  4     2  2010 a     -0.230   0.360  -0.0397 
#>  5     2  2010 b     -0.218  -0.295   0.0200 
#>  6     2  2010 c     -0.208  -0.0285  0.00268
#>  7     3  2010 a      1.56    0.401   0.299  
#>  8     3  2010 b     -1.03    0.895  -0.285  
#>  9     3  2010 c     -1.27   -0.0429  0.0245 
#> 10     4  2011 a      0.0705  0.111   0.00374
#> # … with 20 more rows

Created on 2020-10-26 by the reprex package (v0.3.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM