[英]sum rows from two different data frames based on the value of columns
我有两个数据框
df1
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 100
1 11 2023 Lyon Chicago,Paris 200
2 11 2023 Berlin Paris 300
3 12 2022 Newyork Chicago 150
4 12 2022 Lyon Chicago,Paris 250
5 12 2022 Berlin Paris 400
df2
ID Year Primary_Location Sales
0 11 2023 Chicago 150
1 11 2023 Paris 200
2 12 2022 Chicago 300
3 12 2022 Paris 350
我希望每个组具有相同的ID
和Year
:将df2
中的Sales
列添加到df1
中的Sales
中,其中df2
中的Primary_Location
出现(包含)在df1
中的Secondary_Location
中。
例如:对于ID=11
& Year=2023
, Lyon
的Sales
将添加到Sales
的Chicago
Sales
和Paris
df_2
。
该行的Paris
新Sales
为 200+150+200=550。
预期的 output 将是:
df_primary_output
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 400
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
以下是开始的数据帧:
import pandas as pd
df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
'Year': [2023, 2023, 2023, 2022, 2022, 2022],
'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
'Sales': [100, 200, 300, 150, 250, 400]
})
df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
'Year': [2023, 2023, 2022, 2022],
'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
'Sales': [150, 200, 300, 350]
})
编辑:pandas.errors.InvalidIndexError:重新索引仅对具有唯一值的索引对象有效
如果解决方案也适用于这些输入,那就太好了:
df1
Day ID Year Primary_Location Secondary_Location Sales
0 1 11 2023 NewYork Chicago 100
1 1 11 2023 Berlin Chicago 300
2 1 11 2022 Newyork Chicago 150
3 1 11 2022 Berlin Chicago 400
df2
Day ID Year Primary_Location Sales
0 1 11 2023 Chicago 150
1 1 11 2022 Chicago 300
预期的 output 将是:
df_primary_output
Day ID Year Primary_Location Secondary_Location Sales
0 1 11 2023 NewYork Chicago 250
1 1 11 2023 Berlin Chicago 450
2 1 11 2022 Newyork Chicago 450
3 1 11 2022 Berlin Chicago 700
没那么简单你的问题...
建议脚本
import pandas as pd
df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
'Year': [2023, 2023, 2023, 2022, 2022, 2022],
'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
'Sales': [100, 200, 300, 150, 250, 400]
})
df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
'Year': [2023, 2023, 2022, 2022],
'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
'Sales': [150, 200, 300, 350]
})
tot = []
def func(g, iterdf, len_df1, i = 0):
global tot
kv = {g['Primary_Location'].iloc[i]:g['Sales'].iloc[i] for i in range(len(g))}
while i < len_df1:
row = next(iterdf)[1]
# Select specific df1 rows to modify by ID and Year criteria
if g['ID'].iloc[1]==row['ID'] and g['Year'].iloc[1]==row['Year']:
tot.append(row['Sales'] + sum([kv[town] for town in row['Secondary_Location'].split(',') if town in kv]))
i+=1
df2.groupby(['ID', 'Year']).apply(lambda g: func(g, df1.iterrows(), len(df1)))
df1['Sales'] = tot
print(df1)
结果:
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 450
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
kv
返回这样的字典:
call 1 - {'Chicago': 100, 'Paris': 200}
call 2 - {'Chicago': 300, 'Paris': 350}
您确定第 3 行的结果,我的脚本找到 450 而不是 400?
这应该工作:
s = 'Secondary_Location'
(df1.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)['Sales_2'].sum()
.add(df1['Sales']))
或者
df3 = (df1.assign(Secondary_Location = df1['Secondary_Location'].str.split(',')) #split Secondary_Location column into list and explode it so each row has one value
.explode('Secondary_Location'))
(df3[['ID','Year','Secondary_Location']].apply(tuple,axis=1) #create a series where ID, Year and Secondary_Location are a combined into a tuple so we can map our series created below to bring in the values needed.
.map(df2.set_index(['ID','Year','Primary_Location'])['Sales']) #create a series with lookup values in index, and make a series by selecting Sales column
.groupby(level=0).sum() #when exploding the column above, the index was repeated, so groupby(level=0).sum() will combine back to original form.
.add(df1['Sales'])) #add in original sales column
原答案:
s = 'Secondary_Location'
(df.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)
.agg({**dict.fromkeys(df,'first'),**{s:','.join,'Sales_2':'sum'}})
.assign(Sales = lambda x: x['Sales'] + x['Sales_2'])
.drop('Sales_2',axis=1))
Output:
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 450
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.