简体   繁体   中英

How do I split a data frame by a specific column value, and then apply functions to columns within the data set?

I have a data frame with 3 columns describing accounts:

Age, Users, and Cost

The Age column ranges from 1-20 and what I want to do is to calculate the average Cost by Age and divide that by Average Users by Age.

So for example, What is the average number of Users who are all Age 1 and what is the average Cost of accounts age 1.

The data frame is huge and I prefer not to just type in df = data[data$age_month == 1,] and then applying means to the columns 1 by 1.

Age  Users   Cost
1     2       5
2     15      7
2     124     10
2     43      100
3     232     21212
4     234     21212 
4     12      10000 
4     10      3
5     11      89
6     4       11
6     8       12
6     10      15

So I would want Mean of Cost column where Age = 1 divided by Mean of Users Column where Age = 1 and that for all Ages

Thanks in advance,

Try:

CostbyAge <- with(dat, ave(Cost, Age, FUN=mean) )
UsersbyAge <- with(dat, ave(Users, Age, FUN=mean))
CostbyAge/UsersbyAge
# [1]   2.5000000   0.6428571   0.6428571   0.6428571  91.4310345 121.9335938
# [7] 121.9335938 121.9335938   8.0909091   1.7272727   1.7272727   1.7272727

Here's a way using doBy::summaryBy . Assume dat is your sample data

> library(doBy)
> ( s <- summaryBy(Users+Cost~Age, data = dat) )
#   Age Users.mean   Cost.mean
# 1   1   2.000000     5.00000
# 2   2  60.666667    39.00000
# 3   3 232.000000 21212.00000
# 4   4  85.333333 10405.00000
# 5   5  11.000000    89.00000
# 6   6   7.333333    12.66667
> s$Cost.mean / s$Users.mean
# [1]   2.5000000   0.6428571  91.4310345 121.9335938   8.0909091   1.7272727

Here's a way to do it with dplyr :

library(dplyr)

dat %>%
  group_by(Age) %>%
  summarize(count=length(Age),
            users_mean=round(mean(Users),2),
            cost_mean=round(mean(Cost),2),
            cost_per_user=round(cost_mean/users_mean,2))

  Age count users_mean cost_mean cost_per_user
1   1     1       2.00      5.00          2.50
2   2     3      60.67     39.00          0.64
3   3     1     232.00  21212.00         91.43
4   4     3      85.33  10405.00        121.94
5   5     1      11.00     89.00          8.09
6   6     3       7.33     12.67          1.73

data.table solution

library(data.table)
setDT(dat)[, list(User_mean = mean(Users), 
                  Mean_Cost = mean(Cost), 
                  Cost_per_User = mean(Cost)/mean(Users)), by = Age]

Base R, using aggregate

aggdat <- aggregate(cbind(Users, Cost) ~ Age, dat,  mean)
aggdat$Cost_per_User <- aggdat$Cost/aggdat$Users

Since no one mention it, you can use also from base R split in combination with lapply :

> lapply(split(dat,dat$Age),colMeans)

To output the result as a dataframe and not a list will require this additional step:

> do.call(rbind,lapply(split(dat,dat$Age),colMeans))
  Age      Users        Cost
1   1   2.000000     5.00000
2   2  60.666667    39.00000
3   3 232.000000 21212.00000
4   4  85.333333 10405.00000
5   5  11.000000    89.00000
6   6   7.333333    12.66667

split take your dataframe and creates a list of dataframes split accordingly, then with lapply you do your operation on all sub-dataframe at once (here to compute the mean you can use simply colMeans ). Then do.call(rbind,...) take your result list and turn it back into a dataframe.

The last step to get cost per user is the same as in the other solutions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM