I have two dataframes. In the first one I have the customers and a column with a list of every restaurant he/she visited.
In [1]: df_customers
Out[1]:
Document Restaurants
0 '000000984 [20504916171, 20504916171, 20499859164]
1 '000010076 [20505918674, 20505918674, 20505918674]
2 '000010319 [20253346711, 20524403863, 20508246677]
3 '000018468 [20253346711, 20538456226, 20505918674]
4 '000024409 [20553255881, 20553596441, 20553255881]
5 '000025944 [20492255719, 20600654226]
6 '000031162 [20600351398, 20408462399, 20499859164]
7 '000055177 [20524403863, 20524403863]
8 '000058303 [20600997239, 20524403863, 20600997239]
9 '000074791 [20517920178, 20517920178, 20517920178]
In my other dataframe I have a column with the restaurants and another with a given value for each
In [2]: df_rest
Out [2]:
Restaurant Points
0 10026575473 1
1 10037003331 1
2 10072208299 1
3 10179698400 2
4 10214262750 1
I need to create a column in my customers dataframe with the sum of the points given to each restaurant he/she visited.
I tried something like this:
df_customers["Sum"]=df_rest.loc[df_rest["Restaurant"].isin(df_customers["Restaurants"]),"Points"].sum()
But I'm getting this error:
TypeError: unhashable type: 'list'
I'm trying not to iterate on my customers dataframe, it takes too long. Any help?
Aim not to use lists within Pandas series. Using list
removes the possibility of vectorised operations. More efficient is to expand your jagged array of restaurant lists into a single dataframe, then map to points via a dictionary and sum.
Here's a minimal example:
df1 = pd.DataFrame({'Document': [1, 2],
'Restaurants': [[20504916171, 20504916171, 20499859164],
[20505918674, 20505918674]]})
df2 = pd.DataFrame({'Restaurant': [20504916171, 20504916171, 20499859164,
20505918674, 20505918674],
'Points': [1, 2, 1, 3, 2]})
ratmap = df2.set_index('Restaurant')['Points'].to_dict()
df1['score'] = pd.DataFrame(df1['Restaurants'].values.tolist())\
.applymap(ratmap.get).fillna(0).sum(1).astype(int)
print(df1)
Document Restaurants score
0 1 [20504916171, 20504916171, 20499859164] 5
1 2 [20505918674, 20505918674] 4
I would first expand the df
into:
d = {c: df_customers[c].values.repeat(df_customers.Restaurants.str.len(), axis=0) for c in df_customers.columns}
d['Restaurants'] = [i for sub in df_customers.Restaurants for i in sub]
df3 = pd.DataFrame(d)
Document Restaurants
0 000000984 20504916171
1 000000984 20504916171
2 000000984 20499859164
3 000010076 20505918674
4 000010076 20505918674
5 000010076 20505918674
6 000010319 20253346711
7 000010319 20524403863
Then map
df3['Point'] = df3.Restaurants.map(df_rest.set_index('Restaurant').Points).fillna(0)
Document Restaurants Point
0 000000984a 20504916171 1
1 000000984a 20504916171 1
2 000000984a 20499859164 0
3 000010076a 20505918674 0
4 000010076a 20505918674 0
5 000010076a 20505918674 0
Then groupby
document and sum
df3.groupby('Document').sum()
Restaurants Point
Document
000000984 61509691506 2.0
000010076 61517756022 0.0
000010319 61285997251 0.0
000018468 61297721611 0.0
Values are mocked, because no restaurant id from your df_customers
is present in your df_rest
in the example you provided.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.