简体   繁体   English

根据列的值对来自两个不同数据框的行求和

[英]sum rows from two different data frames based on the value of columns

I have two data frames我有两个数据框

df1

            ID  Year Primary_Location Secondary_Location  Sales
0           11  2023          NewYork            Chicago    100
1           11  2023             Lyon      Chicago,Paris    200
2           11  2023           Berlin              Paris    300
3           12  2022          Newyork            Chicago    150
4           12  2022             Lyon      Chicago,Paris    250
5           12  2022           Berlin              Paris    400

df2

            ID  Year Primary_Location  Sales
0           11  2023          Chicago    150
1           11  2023            Paris    200
2           12  2022          Chicago    300
3           12  2022            Paris    350

I would like for each group having the same ID & Year : to add the column Sales from df2 to Sales in df1 where Primary_Location in df2 appear (contained) in Secondary_Location in df1 .我希望每个组具有相同的IDYear :将df2中的Sales列添加到df1中的Sales中,其中df2中的Primary_Location出现(包含)在df1中的Secondary_Location中。

For example: For ID=11 & Year=2023 , Sales for Lyon would be added to Sales for Chicago & Sales for Paris of df_2 .例如:对于ID=11 & Year=2023LyonSales将添加到SalesChicago SalesParis df_2

New Sales of Paris for that row would be 200+150+200=550.该行的ParisSales为 200+150+200=550。

The expected output would be:预期的 output 将是:

df_primary_output



            ID  Year Primary_Location Secondary_Location  Sales
0           11  2023          NewYork            Chicago    250
1           11  2023             Lyon      Chicago,Paris    550
2           11  2023           Berlin              Paris    500
3           12  2022          Newyork            Chicago    400
4           12  2022             Lyon      Chicago,Paris    900
5           12  2022           Berlin              Paris    750

Here are the dataframes to start with:以下是开始的数据帧:

import pandas as pd

df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
                   'Year': [2023, 2023, 2023, 2022, 2022, 2022],
                   'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
                   'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
                   'Sales': [100, 200, 300, 150, 250, 400]
                   })

df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
                   'Year': [2023, 2023, 2022, 2022],
                   'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
                   'Sales': [150, 200, 300, 350]
                   })

EDIT: pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects编辑:pandas.errors.InvalidIndexError:重新索引仅对具有唯一值的索引对象有效

Would be great if the solution could work for these inputs as well:如果解决方案也适用于这些输入,那就太好了:

df1

       Day  ID  Year Primary_Location Secondary_Location  Sales
0       1   11  2023          NewYork            Chicago    100
1       1   11  2023           Berlin            Chicago    300
2       1   11  2022          Newyork            Chicago    150
3       1   11  2022           Berlin            Chicago    400

df2

     Day    ID  Year Primary_Location  Sales
0     1     11  2023          Chicago    150
1     1     11  2022          Chicago    300

The expected output would be:预期的 output 将是:

df_primary_output



       Day  ID  Year Primary_Location Secondary_Location  Sales
0       1   11  2023          NewYork            Chicago    250
1       1   11  2023           Berlin            Chicago    450
2       1   11  2022          Newyork            Chicago    450
3       1   11  2022           Berlin            Chicago    700

Not so easy your question...没那么简单你的问题...

Proposed script建议脚本

import pandas as pd

df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
                   'Year': [2023, 2023, 2023, 2022, 2022, 2022],
                   'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
                   'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
                   'Sales': [100, 200, 300, 150, 250, 400]
                   })

df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
                   'Year': [2023, 2023, 2022, 2022],
                   'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
                   'Sales': [150, 200, 300, 350]
                   })

tot = []
def func(g, iterdf, len_df1, i = 0):
    global tot
    kv = {g['Primary_Location'].iloc[i]:g['Sales'].iloc[i] for i in range(len(g))}
    while i < len_df1:
        row = next(iterdf)[1]
        # Select specific df1 rows to modify by ID and Year criteria
        if g['ID'].iloc[1]==row['ID'] and g['Year'].iloc[1]==row['Year']:
            tot.append(row['Sales'] + sum([kv[town] for town in row['Secondary_Location'].split(',') if town in kv]))
        i+=1

df2.groupby(['ID', 'Year']).apply(lambda g: func(g, df1.iterrows(), len(df1)))
df1['Sales'] = tot
print(df1)

Result:结果:

   ID  Year Primary_Location Secondary_Location  Sales
0  11  2023          NewYork            Chicago    250
1  11  2023             Lyon      Chicago,Paris    550
2  11  2023           Berlin              Paris    500
3  12  2022          Newyork            Chicago    450
4  12  2022             Lyon      Chicago,Paris    900
5  12  2022           Berlin              Paris    750

kv returns dictionnaries like this: kv返回这样的字典:

call 1 - {'Chicago': 100, 'Paris': 200}
call 2 - {'Chicago': 300, 'Paris': 350}

Are you sure of the result in line 3, my script found 450 and not 400?您确定第 3 行的结果,我的脚本找到 450 而不是 400?

This should work:这应该工作:

s = 'Secondary_Location'
(df1.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)['Sales_2'].sum()
.add(df1['Sales']))

or或者

df3 = (df1.assign(Secondary_Location = df1['Secondary_Location'].str.split(',')) #split Secondary_Location column into list and explode it so each row has one value
.explode('Secondary_Location'))


(df3[['ID','Year','Secondary_Location']].apply(tuple,axis=1) #create a series where ID, Year and Secondary_Location are a combined into a tuple so we can map our series created below to bring in the values needed.
.map(df2.set_index(['ID','Year','Primary_Location'])['Sales']) #create a series with lookup values in index, and make a series by selecting Sales column
.groupby(level=0).sum() #when exploding the column above, the index was repeated, so groupby(level=0).sum() will combine back to original form.
.add(df1['Sales'])) #add in original sales column

Original Answer:原答案:

s = 'Secondary_Location'
(df.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)
.agg({**dict.fromkeys(df,'first'),**{s:','.join,'Sales_2':'sum'}})
.assign(Sales = lambda x: x['Sales'] + x['Sales_2'])
.drop('Sales_2',axis=1))

Output: Output:

   ID  Year Primary_Location Secondary_Location  Sales
0  11  2023          NewYork            Chicago    250
1  11  2023             Lyon      Chicago,Paris    550
2  11  2023           Berlin              Paris    500
3  12  2022          Newyork            Chicago    450
4  12  2022             Lyon      Chicago,Paris    900
5  12  2022           Berlin              Paris    750

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM