简体   繁体   English

R连接两个数据框,按列分组并计算平均值

[英]R join two data frames, group by column and calculate mean

I've Googled around, but I can't seem to find a solution for the problem I'm having.我已经谷歌搜索,但我似乎无法找到解决我遇到的问题的方法。 I have two data frames, one holds movies by ID and contains ratings for them:我有两个数据框,一个按 ID 保存电影并包含对它们的评分:

> summary(ratings)
    movieId        mean_rating      rating_count    
 Min.   :     1   Min.   : 1.000   Min.   :    1.0  
 1st Qu.:  6796   1st Qu.: 5.600   1st Qu.:    3.0  
 Median : 65880   Median : 6.471   Median :   18.0  
 Mean   : 58790   Mean   : 6.266   Mean   :  747.8  
 3rd Qu.: 99110   3rd Qu.: 7.130   3rd Qu.:  205.0  
 Max.   :131262   Max.   :10.000   Max.   :67310.0  
      rn           
 Length:26744      
 Class :character  
 Mode  :character  

The other one is a collection of user defined tags that have been added to these movies.另一个是已添加到这些电影的用户定义标签的集合。 It also has a column called movieId that corresponds to movieId in the first data frame.它还有一个名为movieId的列,对应于第一个数据帧中的movieId

> summary(tags)
     userId          movieId           tag           
 Min.   :    18   Min.   :     1   Length:465564     
 1st Qu.: 28780   1st Qu.:  2571   Class :character  
 Median : 70201   Median :  7373   Mode  :character  
 Mean   : 68712   Mean   : 32628                     
 3rd Qu.:107322   3rd Qu.: 62235                     
 Max.   :138472   Max.   :131258                     
   timestamp               rn           
 Min.   :1135429210   Length:465564     
 1st Qu.:1245007262   Class :character  
 Median :1302291181   Mode  :character  
 Mean   :1298711076                     
 3rd Qu.:1366217861                     
 Max.   :1427771352  

What I want to do, is get the mean movie rating for each of the tags.我想要做的是获取每个标签的平均电影评分。 Basically, the equivalent of this SQL query:基本上,相当于这个 SQL 查询:

SELECT t.tag, AVG(r.mean_rating) FROM movielens_tags t RIGHT JOIN movielens_ratings r ON t.movieId = r.movieId GROUP BY t.tag;

I just need 2 columns in the output:我只需要输出中的 2 列:

      tag      mean_rating
sci_fi         6.23
bollywood      7.45
action         5.75

However, this SQL query will never end.但是,这个 SQL 查询永远不会结束。 That's why I want to do it in R. Can anyone help me on how I should approach this?这就是为什么我想在 R 中做到这一点。任何人都可以帮助我解决这个问题吗?

Here is the dplyr translation of your SQL code (package dplyr should be installed):这是您的 SQL 代码的dplyr翻译(应安装包dplyr ):

library(dplyr)

movielens_tags %>%
  right_join(movielens_ratings, by = "movieId") %>%
  group_by(tag) %>%
  summarise(mean_rating = mean(mean_rating)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM