简体   繁体   中英

Giving value to a dataframe column depending on other dataframe value python

I have two dataframes. In the first one I have the customers and a column with a list of every restaurant he/she visited.

In [1]: df_customers
Out[1]:

              Document   Restaurants
    0        '000000984  [20504916171, 20504916171, 20499859164]
    1        '000010076  [20505918674, 20505918674, 20505918674]
    2        '000010319  [20253346711, 20524403863, 20508246677]
    3        '000018468  [20253346711, 20538456226, 20505918674]
    4        '000024409  [20553255881, 20553596441, 20553255881]
    5        '000025944  [20492255719, 20600654226]
    6        '000031162  [20600351398, 20408462399, 20499859164]
    7        '000055177  [20524403863, 20524403863]
    8        '000058303  [20600997239, 20524403863, 20600997239]
    9        '000074791  [20517920178, 20517920178, 20517920178]

In my other dataframe I have a column with the restaurants and another with a given value for each

In [2]: df_rest
Out [2]:

   Restaurant     Points
0  10026575473    1
1  10037003331    1
2  10072208299    1
3  10179698400    2
4  10214262750    1

I need to create a column in my customers dataframe with the sum of the points given to each restaurant he/she visited.

I tried something like this:

df_customers["Sum"]=df_rest.loc[df_rest["Restaurant"].isin(df_customers["Restaurants"]),"Points"].sum()

But I'm getting this error:

TypeError: unhashable type: 'list'

I'm trying not to iterate on my customers dataframe, it takes too long. Any help?

Aim not to use lists within Pandas series. Using list removes the possibility of vectorised operations. More efficient is to expand your jagged array of restaurant lists into a single dataframe, then map to points via a dictionary and sum.

Here's a minimal example:

df1 = pd.DataFrame({'Document': [1, 2],
                    'Restaurants': [[20504916171, 20504916171, 20499859164],
                                   [20505918674, 20505918674]]})

df2 = pd.DataFrame({'Restaurant': [20504916171, 20504916171, 20499859164,
                                   20505918674, 20505918674],
                    'Points': [1, 2, 1, 3, 2]})

ratmap = df2.set_index('Restaurant')['Points'].to_dict()

df1['score'] = pd.DataFrame(df1['Restaurants'].values.tolist())\
                 .applymap(ratmap.get).fillna(0).sum(1).astype(int)

print(df1)

   Document                              Restaurants  score
0         1  [20504916171, 20504916171, 20499859164]      5
1         2               [20505918674, 20505918674]      4

I would first expand the df into:

d = {c: df_customers[c].values.repeat(df_customers.Restaurants.str.len(), axis=0) for c in df_customers.columns}
d['Restaurants'] = [i for sub in df_customers.Restaurants for i in sub]
df3 = pd.DataFrame(d)

    Document    Restaurants
0   000000984   20504916171
1   000000984   20504916171
2   000000984   20499859164
3   000010076   20505918674
4   000010076   20505918674
5   000010076   20505918674
6   000010319   20253346711
7   000010319   20524403863

Then map

df3['Point'] = df3.Restaurants.map(df_rest.set_index('Restaurant').Points).fillna(0)    


    Document    Restaurants Point
0   000000984a  20504916171     1
1   000000984a  20504916171     1
2   000000984a  20499859164     0
3   000010076a  20505918674     0
4   000010076a  20505918674     0
5   000010076a  20505918674     0

Then groupby document and sum

df3.groupby('Document').sum() 

            Restaurants Point
Document        
000000984   61509691506 2.0
000010076   61517756022 0.0
000010319   61285997251 0.0
000018468   61297721611 0.0

Values are mocked, because no restaurant id from your df_customers is present in your df_rest in the example you provided.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM