简体   繁体   中英

R Aggregate over multiple columns

i´m currently working with a large dataframe of 75 columns and round about 9500 rows. This dataframe contains observations for every day from 1995-2019 for several observation points.

Edit: The print from dput(head(df))

> dput(head(df))
structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135, 
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571, 
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25, 
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5), 
    x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25, 
    61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995, 
    1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3, 
    4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", 
"data.frame"))

The dataframe looks like this sample from it:

date             x1      x2     x3       x4       x5     xn     year    month    day
  <date>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 1995-01-01    50.8    62.2    90.2    60        NA    53.2    1995      1    1
2 1999-08-02    62.6    58.7    NA      72        NA    61.1    1999      8    2
3 2001-09-03    57.2    49.9    70.1    68.4      NA    56.6    2001      9    3
4 2008-05-04    56.6    56.4    75.8    65.5      NA    58.6    2008      5    4
5 2012-04-05    36.8    43.2    83.3    63.2      NA    36.2    2012      4    5
6 2019-12-31    39.1    41.6    98.5    55.9      NA    44.4    2019      12   31
str(df)
tibble [9,131 x 75] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ date   : Date[1:9131], format: "1995-01-01" "1995-01-02" ...
 $ x1     : num [1:9131] 50.8 62.6 57.2 56.6 36.8 ...
 $ x2     : num [1:9131] 62.2 58.7 49.9 56.4 43.2 ...
   xn
 $ year   : num [1:9131] 1995 1995 1995 1995 1995 ...
 $ month  : num [1:9131] 1 1 1 1 1 1 1 1 1 1 ...
 $ day    : num [1:9131] 1 2 3 4 5 6 7 8 9 10 ...

My goal is to get for every observation point xn the count of all observations which cross a certain limit per year. So far i tried to reach this with the Aggregate function.

To get the mean of every year i used the following command:

aggregate(list(df), by=list(year=df$year), mean, na.rm=TRUE)

this works perfect, i get the mean for every year for every observation point.

To get the sum of one station i used the following code

aggregate(list(x1=df$x1), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))

which results in this print:

   year      x1
1  1995      52
2  1996      43
3  1997      44
4  1998      42
5  1999      38
6  2000      76
7  2001      52
8  2002      58
9  2003     110
10 2004      34
11 2005      64
12 2006      46
13 2007      46
14 2008      17
15 2009      41
16 2010      30
17 2011      40
18 2012      47
19 2013      40
20 2014      21
21 2015      56
22 2016      27
23 2017      45
24 2018      22
25 2019      45

So far, so good. I know i could expand the code by adding (..,x2=data$x2, x3=data$x3,..xn) to the list argument in code above. which i tried and they work.

But how do I get them all at once?

I tried the following codes:

aggregate(.~(date, year, month, day), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler: Unerwartete(s) ',' in "aggregate(.~(date,"
aggregate(.~date+year+month+day, by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in as.data.frame.default(data, optional = TRUE) : 
  cannot coerce class ‘"function"’ to a data.frame
aggregate(. ~ date + year + month + day, data = df,by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : 
  Argumente müssen dieselbe Länge haben

But unfortunately none of them works. Could someone please give me a hint where my mistake is?

This should solve your problem

library(tidyverse)
library(lubridate)
df_example <- structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135, 
                                                9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571, 
                                                                               36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25, 
                                                                                                      41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5), 
                             x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_, 
                                                                                 NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25, 
                                                                                                                                           61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995, 
                                                                                                                                                                                            1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3, 
                                                                                                                                                                                                                                              4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", 
                                                                                                                                                                                                                                                                                           "data.frame"))


df_example %>% 
  pivot_longer(x1:x5) %>% 
  mutate(greater_120 = value > 120) %>% 
  group_by(year(date)) %>% 
  summarise(sum_120 = sum(greater_120,na.rm = TRUE))

Here is an answer that uses base R, and since none of the data in the example data is above 120, we set a criterion of above 70.

data <- structure(
     list(
          date = structure(c(9131, 9132, 9133, 9134, 9135,
                             9136), class = "Date"),
          x1 = c(50.75, 62.625, 57.25, 56.571,
                 36.75, 39.125),
          x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
                 41.625),
          x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
          x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875),
          x5 = c(NA_real_,
                 NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
          xn = c(53.25,
                 61.143, 56.571, 58.571, 36.25, 44.375),
          year = c(1995, 1995, 1995, 1995,
                   1995, 1995),
          month = c(1, 1, 1, 1, 1, 1),
          day = c(1, 2, 3,
                  4, 5, 6)
     ),
     row.names = c(NA,-6L),
     class = c("tbl_df", "tbl",
               "data.frame"
     ))

First, we create a subset of the data that contains all columns containing x , and set them to TRUE or FALSE based on whether the value is greater than 70.

theCols <- data[,colnames(data)[grepl("x",colnames(data))]]

Second, we cbind() the year onto the matrix of logical values.

x_logical <- cbind(year = data$year,as.data.frame(apply(theCols,2,function(x) x > 70)))

Finally, we use aggregate across all columns other than year and sum the columns.

aggregate(x_logical[2:ncol(x_logical)],by = list(x_logical$year),sum,na.rm=TRUE)

...and the output:

  Group.1 x1 x2 x3 x4 x5 xn
1    1995  0  0  5  1  0  0
> 

Note that by using colnames() to extract the columns that start with x and nrow() in the aggregate() function, we make this a general solution that will handle a varying number of x locations.

Two tidyverse solutions

A tidyverse solution to the same problem is as follows. It includes the following steps.

  1. Use mutate() with across() to create the TRUE / FALSE versions of the x variables. Note that across() requires dplyr 1.0.0, which is currently in development but due for production release the week of May 25th.

  2. Use pivot_longer() to allow us to summarise() multiple measures without a lot of complicated code.

  3. Use pivot_wider() to convert the data back to one column for each x measurement.

...and the code is:

devtools::install_github("tidyverse/dplyr") # needed for across()
library(dplyr)
library(tidyr) 
library(lubridate) 
data %>%
     mutate(.,across(starts_with("x"),~if_else(. > 70,TRUE,FALSE))) %>%
        select(-year,-month,-day) %>% group_by(date) %>% 
        pivot_longer(starts_with("x"),names_to = "measure",values_to = "value") %>% 
        mutate(year = year(date)) %>% group_by(year,measure) %>%
        select(-date) %>% 
                summarise(value = sum(value,na.rm=TRUE)) %>%
        pivot_wider(id_cols = year,names_from = "measure",
                    values_from = value)

...and the output, which matches the Base R solution that I originally posted:

`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups:   year [1]
   year    x1    x2    x3    x4    x5    xn
  <dbl> <int> <int> <int> <int> <int> <int>
1  1995     0     0     5     1     0     0
> 

...and here's an edited version of the other answer that will also produce the same results as above. This solution implements pivot_longer() before creating the logical variable for exceeding the threshold, so it does not require the across() function. Also note that since this uses 120 as the threshold value and none of the data meets this threshold, the sums are all 0.

df_example %>% 
        pivot_longer(x1:x5) %>% 
        mutate(greater_120 = value > 120) %>% 
        group_by(year,name) %>% 
        summarise(sum_120 = sum(greater_120,na.rm = TRUE)) %>%
        pivot_wider(id_cols = year,names_from = "name", values_from = sum_120)

...and the output:

`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 6
# Groups:   year [1]
   year    x1    x2    x3    x4    x5
  <dbl> <int> <int> <int> <int> <int>
1  1995     0     0     0     0     0
> 

Conclusions

As usual, there are many ways to accomplish a given task in R. Depending on one's preferences, the problem can be solved with Base R or the tidyverse. One of the quirks of the tidyverse is that some operations such as summarise() are much easier to perform on narrow format tidy data than on wide format data. Therefore, it's important to be proficient with tidyr::pivot_longer() and pivot_wider() when working in the tidyverse.

That said, with the production release of dplyr 1.0.0, the team at RStudio continues to add features that facilitate working with wide format data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM