简体   繁体   中英

R: How to aggregate a dataframe using an index column?

I have a dataframe which looks as following:

head(test_df, n =15)
# print the first 15rows of the dataframe

               value frequency                        index
1  -2.90267705917358         1                            1
2  -2.90254878997803         1                            1
3  -2.90252590179443         1                            1
4  -2.90219354629517         1                            1
5  -2.90201354026794         1                            1
6   -2.9016375541687         1                            1
7  -2.90107154846191         1                            1
8  -2.90089440345764         1                            1
9  -2.89996957778931         1                            1
10 -2.89970088005066         1                            1
11 -2.89928865432739         1                            2
12 -2.89920520782471         1                            2
13 -2.89907360076904         1                            2
14 -2.89888191223145         1                            2
15  -2.8988630771637         1                            2

The dataframe has 3columns and 61819rows. To aggregate the dataframe, I want to get the mean value for the columns 'value' and 'frequency' for all rows with the same 'index' .

I already found some useful links, see:

https://www.r-bloggers.com/2018/07/how-to-aggregate-data-in-r/

Which is the simplest way to aggregate rows (sum) by columns values the following type of data frame on R?

However, I could not solve the problem yet.

test_df_ag <- stats::aggregate(test_df[1:2], by = test_df[3], FUN = 'mean')
# aggregate the dataframe based on the 'index' column (build the mean)

   index value frequency
1      1    NA         1
2      2    NA         1
3      3    NA         1
4      4    NA         1
5      5    NA         1
6      6    NA         1
7      7    NA         1
8      8    NA         1
9      9    NA         1
10    10    NA         1
11    11    NA         1
12    12    NA         1
13    13    NA         1
14    14    NA         1
15    15    NA         1

Since I just get NA values for the column 'value', I wonder whether it might just be a data type issue?! However also when I tried to convert the data type I failed...

base::typeof(test_df$value)
# query the data type of the 'value' column
[1] "integer"

1. Here is a base R solution.

aggregate(cbind(value, frequency) ~ index, data = test_df, FUN = mean)
#  index     value frequency
#1     1 -2.901523         1
#2     2 -2.899062         1

2. And a simple dplyr solution.

library(dplyr)

test_df %>%
  group_by(index) %>%
  summarize(across(1:2, mean))
## A tibble: 2 x 3
#  index value frequency
#* <int> <dbl>     <dbl>
#1     1 -2.90         1
#2     2 -2.90         1

Data

test_df <-
structure(list(value = c(-2.90267705917358, -2.90254878997803, 
-2.90252590179443, -2.90219354629517, -2.90201354026794, -2.9016375541687, 
-2.90107154846191, -2.90089440345764, -2.89996957778931, -2.89970088005066, 
-2.89928865432739, -2.89920520782471, -2.89907360076904, -2.89888191223145, 
-2.8988630771637), frequency = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), index = c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"))

Using data.table

library(data.table)
setDT(test_df)[, lapply(.SD, mean), by = index, .SDcols = 1:2]

Try tidyverse. test_summary <- test_df %>% group_by(index) %>% summarise(n=n(), mean_value=mean(value, na.rm=T),mean_frequency=mean(frequency, na.rm=T)) .

Oh, and, of course, you should make sure you've checked the quality of your data and understand the ifs and whys of any NA's in your data set.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM