[英]Calculate association score using two columns in pandas
I have a pandas dataframe where each row is an user, and each column is a movie.我有一个 pandas dataframe ,其中每一行是一个用户,每一列是一部电影。 Each cell holds a rating the user gave the movie.
每个单元格都包含用户对电影的评分。 Some users did not give certain movies a rating, so these values are NaN.
一些用户没有给某些电影评分,因此这些值为 NaN。
pd dataframe converted to dict (for easy copy and paste): pd dataframe 转换为dict(便于复制和粘贴):
{'User': {0: 755,
1: 5277,
2: 1577,
3: 4388,
4: 1202,
5: 3823,
6: 5448,
7: 5347,
8: 4117,
9: 2765,
10: 5450,
11: 139,
12: 1940,
13: 3118,
14: 4656,
15: 4796,
16: 6037,
17: 3048,
18: 4790,
19: 4489},
'Gender (1 =F, 0=M)': {0: 0,
1: 0,
2: 1,
3: 0,
4: 1,
5: 1,
6: 0,
7: 0,
8: 1,
9: 0,
10: 1,
11: 0,
12: 0,
13: 1,
14: 1,
15: 1,
16: 0,
17: 1,
18: 0,
19: 0},
'260: Star Wars: Episode IV - A New Hope (1977)': {0: 1.0,
1: 5.0,
2: nan,
3: nan,
4: 4.0,
5: 2.0,
6: nan,
7: 4.0,
8: 5.0,
9: 4.0,
10: 2.0,
11: 3.0,
12: 2.0,
13: 3.0,
14: 4.0,
15: nan,
16: nan,
17: 4.0,
18: 5.0,
19: 1.0},
'1210: Star Wars: Episode VI - Return of the Jedi (1983)': {0: 5.0,
1: 3.0,
2: nan,
3: 3.0,
4: 3.0,
5: 4.0,
6: nan,
7: nan,
8: 1.0,
9: 2.0,
10: 1.0,
11: 5.0,
12: 3.0,
13: nan,
14: 4.0,
15: nan,
16: nan,
17: 5.0,
18: 1.0,
19: 2.0},
'356: Forrest Gump (1994)': {0: 2.0,
1: nan,
2: nan,
3: nan,
4: 4.0,
5: 4.0,
6: 3.0,
7: nan,
8: nan,
9: nan,
10: 5.0,
11: 2.0,
12: nan,
13: 3.0,
14: nan,
15: 1.0,
16: nan,
17: 1.0,
18: nan,
19: 2.0},
'318: Shawshank Redemption, The (1994)': {0: nan,
1: 2.0,
2: 5.0,
3: nan,
4: 1.0,
5: 4.0,
6: 1.0,
7: nan,
8: 4.0,
9: 5.0,
10: nan,
11: nan,
12: 5.0,
13: nan,
14: nan,
15: nan,
16: nan,
17: 5.0,
18: nan,
19: 4.0},
'593: Silence of the Lambs, The (1991)': {0: 4.0,
1: 4.0,
2: 2.0,
3: nan,
4: 4.0,
5: nan,
6: 1.0,
7: 3.0,
8: 2.0,
9: 3.0,
10: nan,
11: 2.0,
12: 4.0,
13: 2.0,
14: 5.0,
15: 3.0,
16: 4.0,
17: 1.0,
18: nan,
19: 5.0},
'3578: Gladiator (2000)': {0: 4.0,
1: 2.0,
2: nan,
3: 1.0,
4: 1.0,
5: nan,
6: 4.0,
7: 2.0,
8: 4.0,
9: nan,
10: 5.0,
11: nan,
12: nan,
13: nan,
14: 5.0,
15: 2.0,
16: nan,
17: 1.0,
18: 4.0,
19: nan},
'1: Toy Story (1995)': {0: 2.0,
1: 1.0,
2: 4.0,
3: 2.0,
4: nan,
5: 3.0,
6: nan,
7: 2.0,
8: 4.0,
9: 4.0,
10: 5.0,
11: 2.0,
12: 4.0,
13: 3.0,
14: 2.0,
15: nan,
16: 2.0,
17: 4.0,
18: 2.0,
19: 2.0},
'2028: Saving Private Ryan (1998)': {0: 2.0,
1: nan,
2: nan,
3: 3.0,
4: 4.0,
5: 1.0,
6: 5.0,
7: nan,
8: 4.0,
9: 3.0,
10: nan,
11: nan,
12: 5.0,
13: nan,
14: nan,
15: 2.0,
16: nan,
17: nan,
18: 1.0,
19: 3.0},
'296: Pulp Fiction (1994)': {0: nan,
1: nan,
2: nan,
3: 4.0,
4: nan,
5: 4.0,
6: 2.0,
7: 3.0,
8: nan,
9: 4.0,
10: nan,
11: 1.0,
12: nan,
13: nan,
14: 3.0,
15: nan,
16: 2.0,
17: 5.0,
18: 3.0,
19: 2.0},
'1259: Stand by Me (1986)': {0: 3.0,
1: 4.0,
2: 1.0,
3: nan,
4: 1.0,
5: 4.0,
6: nan,
7: nan,
8: 1.0,
9: nan,
10: nan,
11: nan,
12: nan,
13: 4.0,
14: 5.0,
15: 1.0,
16: nan,
17: nan,
18: 3.0,
19: 2.0},
'2396: Shakespeare in Love (1998)': {0: 2.0,
1: 3.0,
2: nan,
3: nan,
4: 5.0,
5: 5.0,
6: 1.0,
7: nan,
8: 2.0,
9: nan,
10: nan,
11: 3.0,
12: nan,
13: nan,
14: nan,
15: 5.0,
16: 2.0,
17: nan,
18: 3.0,
19: 1.0},
'2916: Total Recall (1990)': {0: nan,
1: 2.0,
2: 1.0,
3: 4.0,
4: 1.0,
5: 2.0,
6: nan,
7: 2.0,
8: 3.0,
9: nan,
10: 3.0,
11: nan,
12: 2.0,
13: 1.0,
14: 1.0,
15: nan,
16: nan,
17: nan,
18: 1.0,
19: nan},
'780: Independence Day (ID4) (1996)': {0: 5.0,
1: 2.0,
2: 4.0,
3: 1.0,
4: nan,
5: 4.0,
6: nan,
7: 3.0,
8: 1.0,
9: 2.0,
10: 2.0,
11: 3.0,
12: 4.0,
13: 2.0,
14: 3.0,
15: nan,
16: nan,
17: nan,
18: nan,
19: nan},
'541: Blade Runner (1982)': {0: 2.0,
1: nan,
2: 4.0,
3: 3.0,
4: 4.0,
5: nan,
6: 3.0,
7: 2.0,
8: nan,
9: nan,
10: nan,
11: nan,
12: nan,
13: 2.0,
14: nan,
15: nan,
16: nan,
17: 4.0,
18: nan,
19: 5.0},
'1265: Groundhog Day (1993)': {0: nan,
1: 2.0,
2: 1.0,
3: 5.0,
4: nan,
5: 1.0,
6: nan,
7: 4.0,
8: 5.0,
9: nan,
10: nan,
11: 2.0,
12: 3.0,
13: 3.0,
14: 2.0,
15: 5.0,
16: nan,
17: nan,
18: nan,
19: 5.0},
'2571: Matrix, The (1999)': {0: 4.0,
1: nan,
2: 1.0,
3: nan,
4: 3.0,
5: nan,
6: 1.0,
7: nan,
8: nan,
9: 2.0,
10: 1.0,
11: 5.0,
12: nan,
13: 5.0,
14: nan,
15: 2.0,
16: 4.0,
17: nan,
18: 2.0,
19: 4.0},
"527: Schindler's List (1993)": {0: 2.0,
1: 5.0,
2: 2.0,
3: 5.0,
4: 5.0,
5: nan,
6: nan,
7: 1.0,
8: nan,
9: 5.0,
10: nan,
11: nan,
12: nan,
13: 1.0,
14: 3.0,
15: 2.0,
16: nan,
17: 2.0,
18: nan,
19: 3.0},
'2762: Sixth Sense, The (1999)': {0: 5.0,
1: 1.0,
2: 3.0,
3: 1.0,
4: 5.0,
5: 3.0,
6: nan,
7: 3.0,
8: nan,
9: 1.0,
10: 2.0,
11: nan,
12: nan,
13: nan,
14: nan,
15: 4.0,
16: nan,
17: 1.0,
18: nan,
19: 5.0},
'1198: Raiders of the Lost Ark (1981)': {0: nan,
1: 3.0,
2: 1.0,
3: 1.0,
4: nan,
5: nan,
6: 5.0,
7: 5.0,
8: nan,
9: nan,
10: 1.0,
11: nan,
12: 5.0,
13: nan,
14: 3.0,
15: 3.0,
16: nan,
17: 2.0,
18: nan,
19: 3.0},
'34: Babe (1995)': {0: nan,
1: nan,
2: 3.0,
3: 2.0,
4: nan,
5: 2.0,
6: 2.0,
7: nan,
8: 5.0,
9: nan,
10: 4.0,
11: 2.0,
12: nan,
13: nan,
14: 1.0,
15: 4.0,
16: nan,
17: 5.0,
18: nan,
19: nan}}
I want to calculate movies that most often occur with movie 1 (Toy Story).我想计算电影 1(玩具总动员)最常出现的电影。 In other words, for each movie, I want to calculate the percentage of Toy Story raters who also rated that movie.
换句话说,对于每部电影,我想计算对这部电影进行评分的《玩具总动员》评分者的百分比。 If there are ties, I would want to use the lowest-numbered movie as the higher-ranked one.
如果有平局,我想使用编号最低的电影作为排名较高的电影。 In other words, if movies 541 and 318 are tied, then 318 gets the higher rank.
换句话说,如果电影 541 和 318 并列,则 318 获得更高的排名。
I have tried to do this with a subset of the dataframe in which Toy Story has no null ratings data_subset = data[data['1: Toy Story (1995)'].notnull()]
, then attempted to get the percentage via ((data_subset.count() + data_subset['1: Toy Story (1995)'].count()) / data_subset['1: Toy Story (1995)'].count()).sort_values(ascending=False)
.我尝试使用 dataframe 的子集来执行此操作,其中玩具总动员没有 null 评级
data_subset = data[data['1: Toy Story (1995)'].notnull()]
,然后尝试通过((data_subset.count() + data_subset['1: Toy Story (1995)'].count()) / data_subset['1: Toy Story (1995)'].count()).sort_values(ascending=False)
。 The ranking seems to be correct, but the percentage values seem not to be correct.排名似乎是正确的,但百分比值似乎不正确。
I am not sure I fully understand your question.我不确定我是否完全理解你的问题。
What i have is the following:我拥有的是以下内容:
For each movie find the most associated viewd movie.对于每部电影,找到最相关的观看电影。 In the case of a tie, follow your logic.
在平局的情况下,请遵循您的逻辑。
I am assuming your dataframe name is df
and the nan values are actually `np.nan'我假设您的 dataframe 名称是
df
,而 nan 值实际上是“np.nan”
import pandas as pd
import numpy as np
import re # for regular expressions
df2 = ( (df.drop(columns=['User', 'Gender (1 =F, 0=M)']) > 0.0) * 1 ) .astype(int) # create another dataframe with values 1 and 0 (0 when nan was) only for the movie columns
dfs = [] # empty list
for movie in df2.columns: # iterate over all columns (i.e. movies)
tmp = df2.groupby(movie).mean().T[[1]].reset_index() # group by each movie and get the average (percentage) of people that have seen it.
# these steps just clean the retuned dataframe.
tmp['Movie_num'] = [int(re.sub('(^\d+).*', '\\1', el)) for el in tmp['index']]
tmp = tmp.sort_values([1, 'Movie_num'], ascending=[False, True]).head(1)[['index', 1]]
del tmp.columns.name
tmp.index = [movie]
tmp.index.name = 'Movie'
tmp.columns = ['Frequent_Movie', 'Frequency']
dfs.append(tmp)
df_final = pd.concat(dfs).reset_index()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.