简体   繁体   English

使用 pandas 中的两列计算关联分数

[英]Calculate association score using two columns in pandas

I have a pandas dataframe where each row is an user, and each column is a movie.我有一个 pandas dataframe ,其中每一行是一个用户,每一列是一部电影。 Each cell holds a rating the user gave the movie.每个单元格都包含用户对电影的评分。 Some users did not give certain movies a rating, so these values are NaN.一些用户没有给某些电影评分,因此这些值为 NaN。

pd dataframe converted to dict (for easy copy and paste): pd dataframe 转换为dict(便于复制和粘贴):

{'User': {0: 755,
  1: 5277,
  2: 1577,
  3: 4388,
  4: 1202,
  5: 3823,
  6: 5448,
  7: 5347,
  8: 4117,
  9: 2765,
  10: 5450,
  11: 139,
  12: 1940,
  13: 3118,
  14: 4656,
  15: 4796,
  16: 6037,
  17: 3048,
  18: 4790,
  19: 4489},
 'Gender (1 =F, 0=M)': {0: 0,
  1: 0,
  2: 1,
  3: 0,
  4: 1,
  5: 1,
  6: 0,
  7: 0,
  8: 1,
  9: 0,
  10: 1,
  11: 0,
  12: 0,
  13: 1,
  14: 1,
  15: 1,
  16: 0,
  17: 1,
  18: 0,
  19: 0},
 '260: Star Wars: Episode IV - A New Hope (1977)': {0: 1.0,
  1: 5.0,
  2: nan,
  3: nan,
  4: 4.0,
  5: 2.0,
  6: nan,
  7: 4.0,
  8: 5.0,
  9: 4.0,
  10: 2.0,
  11: 3.0,
  12: 2.0,
  13: 3.0,
  14: 4.0,
  15: nan,
  16: nan,
  17: 4.0,
  18: 5.0,
  19: 1.0},
 '1210: Star Wars: Episode VI - Return of the Jedi (1983)': {0: 5.0,
  1: 3.0,
  2: nan,
  3: 3.0,
  4: 3.0,
  5: 4.0,
  6: nan,
  7: nan,
  8: 1.0,
  9: 2.0,
  10: 1.0,
  11: 5.0,
  12: 3.0,
  13: nan,
  14: 4.0,
  15: nan,
  16: nan,
  17: 5.0,
  18: 1.0,
  19: 2.0},
 '356: Forrest Gump (1994)': {0: 2.0,
  1: nan,
  2: nan,
  3: nan,
  4: 4.0,
  5: 4.0,
  6: 3.0,
  7: nan,
  8: nan,
  9: nan,
  10: 5.0,
  11: 2.0,
  12: nan,
  13: 3.0,
  14: nan,
  15: 1.0,
  16: nan,
  17: 1.0,
  18: nan,
  19: 2.0},
 '318: Shawshank Redemption, The (1994)': {0: nan,
  1: 2.0,
  2: 5.0,
  3: nan,
  4: 1.0,
  5: 4.0,
  6: 1.0,
  7: nan,
  8: 4.0,
  9: 5.0,
  10: nan,
  11: nan,
  12: 5.0,
  13: nan,
  14: nan,
  15: nan,
  16: nan,
  17: 5.0,
  18: nan,
  19: 4.0},
 '593: Silence of the Lambs, The (1991)': {0: 4.0,
  1: 4.0,
  2: 2.0,
  3: nan,
  4: 4.0,
  5: nan,
  6: 1.0,
  7: 3.0,
  8: 2.0,
  9: 3.0,
  10: nan,
  11: 2.0,
  12: 4.0,
  13: 2.0,
  14: 5.0,
  15: 3.0,
  16: 4.0,
  17: 1.0,
  18: nan,
  19: 5.0},
 '3578: Gladiator (2000)': {0: 4.0,
  1: 2.0,
  2: nan,
  3: 1.0,
  4: 1.0,
  5: nan,
  6: 4.0,
  7: 2.0,
  8: 4.0,
  9: nan,
  10: 5.0,
  11: nan,
  12: nan,
  13: nan,
  14: 5.0,
  15: 2.0,
  16: nan,
  17: 1.0,
  18: 4.0,
  19: nan},
 '1: Toy Story (1995)': {0: 2.0,
  1: 1.0,
  2: 4.0,
  3: 2.0,
  4: nan,
  5: 3.0,
  6: nan,
  7: 2.0,
  8: 4.0,
  9: 4.0,
  10: 5.0,
  11: 2.0,
  12: 4.0,
  13: 3.0,
  14: 2.0,
  15: nan,
  16: 2.0,
  17: 4.0,
  18: 2.0,
  19: 2.0},
 '2028: Saving Private Ryan (1998)': {0: 2.0,
  1: nan,
  2: nan,
  3: 3.0,
  4: 4.0,
  5: 1.0,
  6: 5.0,
  7: nan,
  8: 4.0,
  9: 3.0,
  10: nan,
  11: nan,
  12: 5.0,
  13: nan,
  14: nan,
  15: 2.0,
  16: nan,
  17: nan,
  18: 1.0,
  19: 3.0},
 '296: Pulp Fiction (1994)': {0: nan,
  1: nan,
  2: nan,
  3: 4.0,
  4: nan,
  5: 4.0,
  6: 2.0,
  7: 3.0,
  8: nan,
  9: 4.0,
  10: nan,
  11: 1.0,
  12: nan,
  13: nan,
  14: 3.0,
  15: nan,
  16: 2.0,
  17: 5.0,
  18: 3.0,
  19: 2.0},
 '1259: Stand by Me (1986)': {0: 3.0,
  1: 4.0,
  2: 1.0,
  3: nan,
  4: 1.0,
  5: 4.0,
  6: nan,
  7: nan,
  8: 1.0,
  9: nan,
  10: nan,
  11: nan,
  12: nan,
  13: 4.0,
  14: 5.0,
  15: 1.0,
  16: nan,
  17: nan,
  18: 3.0,
  19: 2.0},
 '2396: Shakespeare in Love (1998)': {0: 2.0,
  1: 3.0,
  2: nan,
  3: nan,
  4: 5.0,
  5: 5.0,
  6: 1.0,
  7: nan,
  8: 2.0,
  9: nan,
  10: nan,
  11: 3.0,
  12: nan,
  13: nan,
  14: nan,
  15: 5.0,
  16: 2.0,
  17: nan,
  18: 3.0,
  19: 1.0},
 '2916: Total Recall (1990)': {0: nan,
  1: 2.0,
  2: 1.0,
  3: 4.0,
  4: 1.0,
  5: 2.0,
  6: nan,
  7: 2.0,
  8: 3.0,
  9: nan,
  10: 3.0,
  11: nan,
  12: 2.0,
  13: 1.0,
  14: 1.0,
  15: nan,
  16: nan,
  17: nan,
  18: 1.0,
  19: nan},
 '780: Independence Day (ID4) (1996)': {0: 5.0,
  1: 2.0,
  2: 4.0,
  3: 1.0,
  4: nan,
  5: 4.0,
  6: nan,
  7: 3.0,
  8: 1.0,
  9: 2.0,
  10: 2.0,
  11: 3.0,
  12: 4.0,
  13: 2.0,
  14: 3.0,
  15: nan,
  16: nan,
  17: nan,
  18: nan,
  19: nan},
 '541: Blade Runner (1982)': {0: 2.0,
  1: nan,
  2: 4.0,
  3: 3.0,
  4: 4.0,
  5: nan,
  6: 3.0,
  7: 2.0,
  8: nan,
  9: nan,
  10: nan,
  11: nan,
  12: nan,
  13: 2.0,
  14: nan,
  15: nan,
  16: nan,
  17: 4.0,
  18: nan,
  19: 5.0},
 '1265: Groundhog Day (1993)': {0: nan,
  1: 2.0,
  2: 1.0,
  3: 5.0,
  4: nan,
  5: 1.0,
  6: nan,
  7: 4.0,
  8: 5.0,
  9: nan,
  10: nan,
  11: 2.0,
  12: 3.0,
  13: 3.0,
  14: 2.0,
  15: 5.0,
  16: nan,
  17: nan,
  18: nan,
  19: 5.0},
 '2571: Matrix, The (1999)': {0: 4.0,
  1: nan,
  2: 1.0,
  3: nan,
  4: 3.0,
  5: nan,
  6: 1.0,
  7: nan,
  8: nan,
  9: 2.0,
  10: 1.0,
  11: 5.0,
  12: nan,
  13: 5.0,
  14: nan,
  15: 2.0,
  16: 4.0,
  17: nan,
  18: 2.0,
  19: 4.0},
 "527: Schindler's List (1993)": {0: 2.0,
  1: 5.0,
  2: 2.0,
  3: 5.0,
  4: 5.0,
  5: nan,
  6: nan,
  7: 1.0,
  8: nan,
  9: 5.0,
  10: nan,
  11: nan,
  12: nan,
  13: 1.0,
  14: 3.0,
  15: 2.0,
  16: nan,
  17: 2.0,
  18: nan,
  19: 3.0},
 '2762: Sixth Sense, The (1999)': {0: 5.0,
  1: 1.0,
  2: 3.0,
  3: 1.0,
  4: 5.0,
  5: 3.0,
  6: nan,
  7: 3.0,
  8: nan,
  9: 1.0,
  10: 2.0,
  11: nan,
  12: nan,
  13: nan,
  14: nan,
  15: 4.0,
  16: nan,
  17: 1.0,
  18: nan,
  19: 5.0},
 '1198: Raiders of the Lost Ark (1981)': {0: nan,
  1: 3.0,
  2: 1.0,
  3: 1.0,
  4: nan,
  5: nan,
  6: 5.0,
  7: 5.0,
  8: nan,
  9: nan,
  10: 1.0,
  11: nan,
  12: 5.0,
  13: nan,
  14: 3.0,
  15: 3.0,
  16: nan,
  17: 2.0,
  18: nan,
  19: 3.0},
 '34: Babe (1995)': {0: nan,
  1: nan,
  2: 3.0,
  3: 2.0,
  4: nan,
  5: 2.0,
  6: 2.0,
  7: nan,
  8: 5.0,
  9: nan,
  10: 4.0,
  11: 2.0,
  12: nan,
  13: nan,
  14: 1.0,
  15: 4.0,
  16: nan,
  17: 5.0,
  18: nan,
  19: nan}}

I want to calculate movies that most often occur with movie 1 (Toy Story).我想计算电影 1(玩具总动员)最常出现的电影。 In other words, for each movie, I want to calculate the percentage of Toy Story raters who also rated that movie.换句话说,对于每部电影,我想计算对这部电影进行评分的《玩具总动员》评分者的百分比。 If there are ties, I would want to use the lowest-numbered movie as the higher-ranked one.如果有平局,我想使用编号最低的电影作为排名较高的电影。 In other words, if movies 541 and 318 are tied, then 318 gets the higher rank.换句话说,如果电影 541 和 318 并列,则 318 获得更高的排名。

I have tried to do this with a subset of the dataframe in which Toy Story has no null ratings data_subset = data[data['1: Toy Story (1995)'].notnull()] , then attempted to get the percentage via ((data_subset.count() + data_subset['1: Toy Story (1995)'].count()) / data_subset['1: Toy Story (1995)'].count()).sort_values(ascending=False) .我尝试使用 dataframe 的子集来执行此操作,其中玩具总动员没有 null 评级data_subset = data[data['1: Toy Story (1995)'].notnull()] ,然后尝试通过((data_subset.count() + data_subset['1: Toy Story (1995)'].count()) / data_subset['1: Toy Story (1995)'].count()).sort_values(ascending=False) The ranking seems to be correct, but the percentage values seem not to be correct.排名似乎是正确的,但百分比值似乎不正确。

I am not sure I fully understand your question.我不确定我是否完全理解你的问题。

What i have is the following:我拥有的是以下内容:

For each movie find the most associated viewd movie.对于每部电影,找到最相关的观看电影。 In the case of a tie, follow your logic.在平局的情况下,请遵循您的逻辑。

I am assuming your dataframe name is df and the nan values are actually `np.nan'我假设您的 dataframe 名称是df ,而 nan 值实际上是“np.nan”

import pandas as pd
import numpy as np
import re # for regular expressions

df2 = ( (df.drop(columns=['User', 'Gender (1 =F, 0=M)']) > 0.0) * 1 ) .astype(int) # create another dataframe with values 1 and 0 (0 when nan was) only for the movie columns

dfs = [] # empty list 

for movie in df2.columns: # iterate over all columns (i.e. movies)

    tmp = df2.groupby(movie).mean().T[[1]].reset_index() # group by each movie and get the average (percentage) of people that have seen it. 

    # these steps just clean the retuned dataframe.
    tmp['Movie_num'] = [int(re.sub('(^\d+).*', '\\1', el)) for el in tmp['index']]
    tmp = tmp.sort_values([1, 'Movie_num'], ascending=[False, True]).head(1)[['index', 1]]
    del tmp.columns.name
    tmp.index = [movie]
    tmp.index.name = 'Movie'
    tmp.columns = ['Frequent_Movie', 'Frequency']
    dfs.append(tmp)

df_final = pd.concat(dfs).reset_index()


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM