[英]Pandas: Add a new column in a data frame based on a value in another data frame
I have two data frames one with a userId, and gender and another data frame that has online activities of these users. 我有两个数据框,其中一个具有userId和gender,另一个数据框具有这些用户的在线活动。
First Data Frame (df1)
第一个数据帧(df1)
userId, gender
001, F
002, M
003, F
004, M
005, M
006, M
Second data frame (df2)
第二个数据帧(df2)
userId, itemClicked, ItemBought, date
001, 123182, 123212, 02/02/2016
003, 234256, 123182, 05/02/2016
005, 986834, 234256, 04/19/2016
004, 787663, 787663, 05/12/2016
020, 465738, 465738, 03/20/2016
004, 787223, 787663, 07/12/2016
I want to add gender column to the second data frame by looking up the first data frame based on the userId. 我想通过基于userId查找第一个数据框将性别列添加到第二个数据框。 df2 might have multiple rows per user since its a click data where same user may have click multiple items. df2每个用户可能有多行,因为df2是一个点击数据,同一用户可能有多个项目。
This is very easy to do in MySql but I am trying to figure out to do it in pandas. 这在MySql中很容易做到,但是我试图在熊猫中做到这一点。
for index, row in df2.iterrows():
user_id = row['userId']
if user_id in df1['userId']:
t = df1.loc[df1['userId'] == user_id]
pdb.set_trace()
Is this the pandas way to so such a task? 这是熊猫的方式吗?
print (df1)
userId gender
0 1 F
1 2 M
2 3 F
3 4 M
4 5 M
5 6 M
print (df2)
userId itemClicked ItemBought date
0 1 123182 123212 02/02/2016
1 3 234256 123182 05/02/2016
2 5 986834 234256 04/19/2016
3 4 787663 787663 05/12/2016
4 20 465738 465738 03/20/2016
5 4 787223 787663 07/12/2016
df2['gender'] = df2.userId.map(df1.set_index('userId')['gender'].to_dict())
print (df2)
userId itemClicked ItemBought date gender
0 1 123182 123212 02/02/2016 F
1 3 234256 123182 05/02/2016 F
2 5 986834 234256 04/19/2016 M
3 4 787663 787663 05/12/2016 M
4 20 465738 465738 03/20/2016 NaN
5 4 787223 787663 07/12/2016 M
Another solution with merge
and left join, parameter on
can be omit if only column gender
is same in both DataFrames: 用另一种解决方案merge
和左的连接,参数on
可以忽略,如果只列gender
在两个DataFrames相同:
df = pd.merge(df2, df1, how='left')
print (df)
userId itemClicked ItemBought date gender
0 1 123182 123212 02/02/2016 F
1 3 234256 123182 05/02/2016 F
2 5 986834 234256 04/19/2016 M
3 4 787663 787663 05/12/2016 M
4 20 465738 465738 03/20/2016 NaN
5 4 787223 787663 07/12/2016 M
Timings : 时间 :
#len(df2) = 600k
df2 = pd.concat([df2]*100000).reset_index(drop=True)
def f(df1,df2):
df2['gender'] = df2.userId.map(df1.set_index('userId')['gender'].to_dict())
return df2
In [43]: %timeit f(df1,df2)
10 loops, best of 3: 34.2 ms per loop
In [44]: %timeit (pd.merge(df2, df1, how='left'))
10 loops, best of 3: 102 ms per loop
如果user_id是索引,则可以使用:
df2.join(df1)
You can try this: 您可以尝试以下方法:
for index, row in df1.iterrows():
for ind,r in df2.iterrows():
if r['userId']==row['userId']:
df2.set_value(ind,'Gender',row['gender'])
break
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.