I have this data set
user_id business_id date stars review_length pos_words neg_words net_sentiment
Xqd0DzHaiyRqVH3WRG7hzg vcNAWiLM4dR7D2nwwJ7nCA 17/05/07 5 94 4 1 3
H1kH6QZV7Le4zqTRNxoZow vcNAWiLM4dR7D2nwwJ7nCA 22/03/10 2 114 3 7 -4
zvJCcrpm2yOZrxKffwGQLA vcNAWiLM4dR7D2nwwJ7nCA 14/02/12 4 55 6 0 6
KBLW4wJA_fwoWmMhiHRVOA vcNAWiLM4dR7D2nwwJ7nCA 2/03/12 4 97 0 3 -3
zvJCcrpm2yOZrxKffwGQLA vcNAWiLM4dR7D2nwwJ7nCA 15/05/12 4 53 1 2 -1
yelp<- read.csv("yelp_ratings.csv")
colnames(yelp)
[1] "user_id" "business_id" "date" "stars" "review_length"
[6] "pos_words" "neg_words" "net_sentiment"
I need to use dplyr to determine the businesses that have the best and worst ratings --- determine by the value in net_sentiment--- and determine as well the users who gave the best and worst ratings (using the value in net_sentiment as well) for that particular business id .
Heres what I have right now,
yelp %>%
group_by(business_id,user_id) %>%
summarise(net_sentiment = max(net_sentiment)) %>%
arrange(desc(net_sentiment)) %>%
head(n=20)
This gives a print out of, from my data set
business_id user_id net_sentiment
1 -5RN56jH78MV2oquLV_G8g xNb8pFe99ENj8BeMsCBPcQ 80
2 gVYju3XRcO1R4aNk7SZJcA xNb8pFe99ENj8BeMsCBPcQ 78
3 ORiLSAAV4srZ_twFy1tWpw xNb8pFe99ENj8BeMsCBPcQ 77
4 gVYju3XRcO1R4aNk7SZJcA ULOPLvLghKZrfo3PhwbPAQ 74
5 4uGHPY-OpJN08CabtTAvNg xNb8pFe99ENj8BeMsCBPcQ 72
which shows the business with the highest net_sentiment score and also the user who gave that net_sentiment score.
What I intend to achieve is something like
For the business with best rating:
business_id user_id_best_rating pos_net_sentiment user_id_worst_rating neg_net_sentiment
-5RN56jH78MV2oquLV_G8g xNb8pFe99ENj8BeMsCBPcQ 80 user123 -50
For the business with worst rating:
business_id user_id_best_rating pos_net_sentiment user_id_worst_rating neg_net_sentiment
business123 user345 10 user789 -150
Again to clarify, using dplyr, it should be a listing of the best businesses first determine by the net_sentiment score and the users who gave the best and worst rating for that business and the same should be applied to the worst businesses.
Here is a single pipe that can get you the first table; after that, resorting will get you the second table very easily. If you pull off the head each time then you get your single line of desired output.
The logic is basically to group by business and mutate the best and worst results into their own columns, then you can use that result as the key for a column of the userID_best_rating. If you have are getting too many results from that key, then add the business ID along as the secondary key (essentially utilizing a composite key of Score-BusiID for each UserID).
The pipe adds in ID's for highest positive and negative reviews and then trims off the extras before it sorts the highest rating to the top.
# simplified transportable data demonstrating similar pattern of overlap
busiID <- c('a','b','c','b','e')
userID <- c(1,1,1,2,1)
netSenti <- c(80,78,77,74,72)
ylp <- data.frame(busiID,userID,netSenti)
SmryYlp <-
ylp %>%
group_by(busiID) %>%
mutate(pos_netSenti = max(netSenti), neg_netSenti = min(netSenti)) %>%
left_join(select(ylp, neg_netSenti = netSenti, user_id_worst_rating = userID)) %>%
left_join(select(ylp, pos_netSenti = netSenti, user_id_best_rating = userID)) %>%
select(busiID, user_id_best_rating, pos_netSenti, user_id_worst_rating, neg_netSenti) %>%
ungroup %>% distinct %>%
arrange(desc(pos_netSenti))
SmryYlp
## A tibble: 4 × 5
# busiID user_id_best_rating pos_netSenti user_id_worst_rating neg_netSenti
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 80 1 80
# 2 b 1 78 2 74
# 3 c 1 77 1 77
# 4 e 1 72 1 72
Hope this helps 🙂
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.