简体   繁体   中英

Summing across rows conditional on groups with dplyr using select, group_by, and mutate

Problem : I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market.

Closest answer I've found : by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio .

Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)):

df<- blp %>% 
  select(cat,yr,s) %>%
  group_by(cat,yr) %>% 
  summarise(group_share = sum(s))

#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
  for(j in 1:15){
    if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
      blp[i,50]=df[j,3]
      }
  }
}

This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful.

Example for the site : This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor.

#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)

blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))

names(blp)<-c("cat","yr","s")
head(blp)

#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))

#works thanks to akrun: applying the code I provided for what leads to the 15 groups 
df <- blp %>% 
    select(cat,yr,s) %>%
    group_by(cat,yr) %>% 
    summarise(group_share = sum(as.numeric(as.character(s)))) 
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
        if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
          blp[i,4]=df[j,3];
    }
  }

if I understood your problem correctly this should ideally help! Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one.

# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")

yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)

s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)

# Calculation
blp <- 
  data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
  group_by(cat, yr) %>% # Grouping by category and year
  mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year 
  ungroup()

Expected output Expected output

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM