I have a large data.frame. The data.frame include a lot of values.
For example:
df <- data.frame(Company = c('A', 'A', 'B', 'C', 'A', 'B', 'B', 'C', 'C'),
Name = c("Wayne", "Duane", "William", "Rafael", "John", "Eric", "James", "Pablo", "Tammy"),
Age = c(26, 27, 28, 32, 28, 24, 34, 30, 25),
Wages = c(50000, 70000, 70000, 60000, 50000, 70000, 65000, 50000, 50000),
Education.University = c(1, 1, 1, 0, 0, 1, 1, 0, 1),
Productivity = c(100, 120, 120, 95, 88, 115, 100, 90, 120))
How can I aggregate my data.frame
? I want to analyze values on every Company. It must look like:
Age -> average Age of all employees in Company
Wages -> average Wages of all employees in Company
Education.University -> sum of factors (1 or 0) for all employees in Company
Productivity -> average Productivity of all employees in Company
Base R
cbind(aggregate(.~Company, df[,-c(2, 5)], mean),
aggregate(Education.University~Company, df, sum)[-1])
# Company Age Wages Productivity Education.University
#1 A 27.00000 56666.67 102.6667 2
#2 B 28.66667 68333.33 111.6667 3
#3 C 29.00000 53333.33 101.6667 1
Here is the longer version that may be easier to understand
merge(x = aggregate(x = list(Age_av = df$Age,
Wages_av = df$Wages,
Productivity_av = df$Productivity),
by = list(Company = df$Company),
FUN = mean),
y = aggregate(x = list(Education.University_sum = df$Education.University),
by = list(Company = df$Company),
FUN = sum),
by = "Company")
# Company Age_av Wages_av Productivity_av Education.University_sum
#1 A 27.00000 56666.67 102.6667 2
#2 B 28.66667 68333.33 111.6667 3
#3 C 29.00000 53333.33 101.6667 1
One option is using data.table
library(data.table)
setDT(df)[, c(lapply(.SD[, c(2:3, 5), with = FALSE], mean),
.(Education.University = sum(Education.University))), by = Company]
# Company Age Wages Productivity Education.University
#1: A 27.00000 56666.67 102.6667 2
#2: B 28.66667 68333.33 111.6667 3
#3: C 29.00000 53333.33 101.6667 1
Or with dplyr
library(dplyr)
df %>%
group_by(Company) %>%
mutate(Education.University = sum(Education.University)) %>%
summarise_if(is.numeric, mean)
# A tibble: 3 x 5
# Company Age Wages Education.University Productivity
# <fctr> <dbl> <dbl> <dbl> <dbl>
#1 A 27.00000 56666.67 2 102.6667
#2 B 28.66667 68333.33 3 111.6667
#3 C 29.00000 53333.33 1 101.6667
You can easily do it by using dplyr library.
library(dplyr)
df %>% group_by(Company) %>% summarise(Age = mean(Age), Wages = mean(Wages), Education.University = sum(Education.University), Productivity = mean(Productivity))
The concise data.table
solution already posted is using column numbers instead of column names . This is considered bad practice according to Frequently Asked Questions about data.table, section 1.1 :
If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you forget to change all the places in your code which refer to column number 5.
So, I would like to propose alternative approaches which use column names.
library(data.table)
setDT(df)[, .(average.Age = mean(Age),
average.Wages = mean(Wages),
sum.Education.University = sum(Education.University),
average.Productivity = mean(Productivity)),
by = Company]
Company average.Age average.Wages sum.Education.University average.Productivity 1: A 27.00000 56666.67 2 102.6667 2: B 28.66667 68333.33 3 111.6667 3: C 29.00000 53333.33 1 101.6667
Here, every column is aggregated separately. Although it requires more of typing, it has several benefits:
If there are many columns which require the same operations, the data.table
FAQ recommends to use .SDcols
. So, we can do
m_cols <- c("Age", "Wages", "Productivity")
s_cols <- c("Education.University")
by_cols <- c("Company")
setDT(df)[, c(.SD[, lapply(.SD, mean), .SDcols = m_cols],
.SD[, lapply(.SD, sum ), .SDcols = s_cols]),
by = by_cols]
Company Age Wages Productivity Education.University 1: A 27.00000 56666.67 102.6667 2 2: B 28.66667 68333.33 111.6667 3 3: C 29.00000 53333.33 101.6667 1
This is similar to Akrun's answer but uses column names instead of column numbers . In addition, the column names are stored in a variable which is handy for programming.
Note that by_cols
may contain additional columns for aggregation, .eg,
by_cols <- c("Company", "Name")
If column order matters, we can use setcolorder()
:
result <- setDT(df)[, c(.SD[, lapply(.SD, mean), .SDcols = m_cols],
.SD[, lapply(.SD, sum ), .SDcols = s_cols]),
by = by_cols]
setcolorder(result, intersect(names(df), names(result)))
result
Company Age Wages Education.University Productivity 1: A 27.00000 56666.67 2 102.6667 2: B 28.66667 68333.33 3 111.6667 3: C 29.00000 53333.33 1 101.6667
Likewise, the column names of the result can be amended to meet OP's requirements:
setnames(result, m_cols, paste0("average.", m_cols))
setnames(result, s_cols, paste0("sum.", s_cols))
result
Company average.Age average.Wages sum.Education.University average.Productivity 1: A 27.00000 56666.67 2 102.6667 2: B 28.66667 68333.33 3 111.6667 3: C 29.00000 53333.33 1 101.6667
Note that the data.table
functions setcolorder()
and setnames()
work in place , ie, without copying the data.table
object. This saves memory and time which is of particular importantance when dealing with large tables.
Just use the "aggregate" function
aggregate(x = df[c("Age","Wages","Education.University","Productivity")], by = df[c("Company")], FUN = mean)
# Company Age Wages Education.University Productivity
#1 A 27.00000 56666.67 0.6666667 102.6667
#2 B 28.66667 68333.33 1.0000000 111.6667
#3 C 29.00000 53333.33 0.3333333 101.6667
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.