简体   繁体   中英

Using dplyr in R: How to summarise data on same column with different criteria

I have this data set

user_id                 business_id             date      stars review_length pos_words neg_words   net_sentiment
Xqd0DzHaiyRqVH3WRG7hzg  vcNAWiLM4dR7D2nwwJ7nCA  17/05/07    5   94              4       1              3
H1kH6QZV7Le4zqTRNxoZow  vcNAWiLM4dR7D2nwwJ7nCA  22/03/10    2   114             3       7             -4
zvJCcrpm2yOZrxKffwGQLA  vcNAWiLM4dR7D2nwwJ7nCA  14/02/12    4   55              6       0              6
KBLW4wJA_fwoWmMhiHRVOA  vcNAWiLM4dR7D2nwwJ7nCA  2/03/12     4   97              0       3              -3
zvJCcrpm2yOZrxKffwGQLA  vcNAWiLM4dR7D2nwwJ7nCA  15/05/12    4   53              1       2             -1


yelp<- read.csv("yelp_ratings.csv")
colnames(yelp)
 [1] "user_id"       "business_id"   "date"          "stars"         "review_length"
 [6] "pos_words"    "neg_words"     "net_sentiment"

I need to use dplyr to determine the businesses that have the best and worst ratings --- determine by the value in net_sentiment--- and determine as well the users who gave the best and worst ratings (using the value in net_sentiment as well) for that particular business id .

Heres what I have right now,

yelp %>%
  group_by(business_id,user_id) %>%
  summarise(net_sentiment = max(net_sentiment)) %>%
  arrange(desc(net_sentiment)) %>%
  head(n=20)

This gives a print out of, from my data set

              business_id                user_id net_sentiment
1  -5RN56jH78MV2oquLV_G8g xNb8pFe99ENj8BeMsCBPcQ            80
2  gVYju3XRcO1R4aNk7SZJcA xNb8pFe99ENj8BeMsCBPcQ            78
3  ORiLSAAV4srZ_twFy1tWpw xNb8pFe99ENj8BeMsCBPcQ            77
4  gVYju3XRcO1R4aNk7SZJcA ULOPLvLghKZrfo3PhwbPAQ            74
5  4uGHPY-OpJN08CabtTAvNg xNb8pFe99ENj8BeMsCBPcQ            72

which shows the business with the highest net_sentiment score and also the user who gave that net_sentiment score.

What I intend to achieve is something like

For the business with best rating:

            business_id    user_id_best_rating pos_net_sentiment user_id_worst_rating neg_net_sentiment
 -5RN56jH78MV2oquLV_G8g xNb8pFe99ENj8BeMsCBPcQ                80              user123               -50

For the business with worst rating:

business_id user_id_best_rating pos_net_sentiment user_id_worst_rating    neg_net_sentiment
business123             user345                10              user789                 -150

Again to clarify, using dplyr, it should be a listing of the best businesses first determine by the net_sentiment score and the users who gave the best and worst rating for that business and the same should be applied to the worst businesses.

Here is a single pipe that can get you the first table; after that, resorting will get you the second table very easily. If you pull off the head each time then you get your single line of desired output.

The logic is basically to group by business and mutate the best and worst results into their own columns, then you can use that result as the key for a column of the userID_best_rating. If you have are getting too many results from that key, then add the business ID along as the secondary key (essentially utilizing a composite key of Score-BusiID for each UserID).

The pipe adds in ID's for highest positive and negative reviews and then trims off the extras before it sorts the highest rating to the top.

# simplified transportable data demonstrating similar pattern of overlap
busiID <- c('a','b','c','b','e')
userID <- c(1,1,1,2,1)
netSenti <- c(80,78,77,74,72)
ylp <- data.frame(busiID,userID,netSenti)

SmryYlp <- 
    ylp %>% 
    group_by(busiID) %>% 
    mutate(pos_netSenti = max(netSenti), neg_netSenti = min(netSenti)) %>% 
    left_join(select(ylp, neg_netSenti = netSenti, user_id_worst_rating = userID)) %>% 
    left_join(select(ylp, pos_netSenti = netSenti, user_id_best_rating = userID)) %>% 
    select(busiID, user_id_best_rating, pos_netSenti, user_id_worst_rating, neg_netSenti) %>% 
    ungroup %>% distinct %>% 
    arrange(desc(pos_netSenti))

SmryYlp
## A tibble: 4 × 5
#   busiID user_id_best_rating pos_netSenti user_id_worst_rating neg_netSenti
#   <fctr>               <dbl>        <dbl>                <dbl>        <dbl>
# 1      a                   1           80                    1           80
# 2      b                   1           78                    2           74
# 3      c                   1           77                    1           77
# 4      e                   1           72                    1           72

Hope this helps 🙂

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM