[英]How to fetch data from an excel sheet and get the output in set format?
I'm making a movie recommendation system. 我正在制作电影推荐系统。 I need a python code which converts the data imported from an excel sheet to a set format (as shown below). 我需要一个python代码,它将从excel工作表导入的数据转换为设置格式(如下所示)。
enter image description here 在此处输入图片说明
Code to import data from the excel sheet: 从Excel工作表导入数据的代码:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('project.xlsx', sheetname='Sheet1')
df.head(40)
Output I get: 我得到的输出:
USER MOVIE RATINGS
0 Julia Roberts Shrek 2.5
1 NaN V for Vendetta 3.5
2 NaN Pretty Woman 3.0
3 NaN Star Wars 3.5
4 NaN While You Were Sleeping 2.5
5 NaN Phone Booth 3.0
6 Drew Barrymore Shrek 3.0
7 NaN V for Vendetta 3.5
8 NaN Pretty Woman 1.5
9 NaN Star Wars 5.0
10 NaN Phone Booth 3.0
11 NaN While You Were Sleeping 3.5
12 Kate Winslet Shrek 2.5
13 NaN V for Vendetta 3.0
14 NaN Star Wars 3.5
15 NaN Phone Booth 4.0
16 Tom Hanks While You Were Sleeping 2.5
17 NaN V for Vendetta 3.5
18 NaN Pretty Woman 3.0
19 NaN Star Wars 4.0
20 NaN Phone Booth 4.5
....
......
......
......
enter image description here 在此处输入图片说明
From here I need to have an output like this: 从这里,我需要这样的输出:
dataset={
'Julia Roberts': {
'Shrek': 2.5,
'I am Legend':3.0,
'V for Vendetta': 3.5,
'Pretty Woman': 0,
"My Sister's Keeper":5.0,
'Star Wars': 3.5,
'Me Before You': 3.0,
'While You Were Sleeping': 2.5,
'Phone Booth': 3.0},
'Drew Barrymore': {'Shrek': 3.0,
'V for Vendetta': 3.5,
'Pretty Woman': 1.5,
"My Sister's Keeper":4.0,
'Star Wars': 5.0,
'Phone Booth': 3.0,
'While You Were Sleeping': 3.5},
'Tom Hanks': {'V for Vendetta': 3.5,
'Pretty Woman': 3.0,
'Phone Booth': 4.5,
'Star Wars': 4.0,
'While You Were Sleeping': 2.5,
'I am Legend':3.5},
'Sandra Bullock': {'Shrek': 3.0,
'V for Vendetta': 4.0,
'Pretty Woman': 2.0,
'Star Wars': 3.0,
'I am Legend':4.5,
"My Sister's Keeper":3.5,
'Phone Booth': 3.0,
'While You Were Sleeping': 2.0}
}
Code I am using (but showing error): 我正在使用的代码(但显示错误):
max_nb_row = 0
for sheet in df.sheets():
max_nb_row = max(max_nb_row, sheet.nrows)
for row in range(max_nb_row) :
for sheet in df.sheets() :
if row < sheet.nrows :
print (sheet.row(row))
You can use this incomprehensible one-liner: 您可以使用这种难以理解的单线:
df.ffill().groupby('user').apply(lambda x: dict(zip(x['movie'], x['ratings']))).to_dict()
To visualize what's happening, we'll use this smaller dataframe: 为了可视化正在发生的事情,我们将使用以下较小的数据框:
>>> df
user movie ratings
0 Julia Roberts Shrek 2.5
1 NaN V for Vendetta 3.5
2 NaN Pretty Woman 3.0
3 Drew Barrymore Shrek 3.0
4 NaN V for Vendetta 3.5
Step by step, this is what happens: 逐步,这是发生的情况:
Use ffill
to replace the NaN
values in the user
column with the name above. 使用ffill
将user
栏中的NaN
值替换为上面的名称。
user movie ratings 0 Julia Roberts Shrek 2.5 1 Julia Roberts V for Vendetta 3.5 2 Julia Roberts Pretty Woman 3.0 3 Drew Barrymore Shrek 3.0 4 Drew Barrymore V for Vendetta 3.5
Use groupby('user')
to group the data by user 使用groupby('user')
按用户分组数据
Use apply(lambda x: dict(zip(x['movie'], x['ratings']))
to create dicts of {movie: rating}
pairs. 使用apply(lambda x: dict(zip(x['movie'], x['ratings']))
创建{movie: rating}
对的字典。
user Drew Barrymore {'Shrek': 3.0, 'V for Vendetta': 3.5} Julia Roberts {'Shrek': 2.5, 'V for Vendetta': 3.5, 'Pretty ... dtype: object
Call to_dict()
on the final dataframe to get the desired result. 在最终数据帧上调用to_dict()
以获得所需的结果。
{'Drew Barrymore': {'Shrek': 3.0, 'V for Vendetta': 3.5}, 'Julia Roberts': {'Pretty Woman': 3.0, 'Shrek': 2.5, 'V for Vendetta': 3.5}}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.