[英]Efficiently add column to Pandas DataFrame with values from another DataFrame
I have a simple database consisting of 2 tables (say, Items and Users), where a column of the Users is their User_ID , a column of the Items is their Item_ID and another column of the Items is a foreign key to a User_ID , for instance: 我有一个简单的数据库,其中包含2个表(例如,Items和Users),其中Users的一列是其User_ID ,Items的一列是其Item_ID ,Items的另一列是User_ID的外键 ,对于实例:
Items Users
Item_ID Value_A Its_User_ID ... User_ID Name ...
1 35 1 1 Alice
2 991 1 2 John
3 20 2
Imagine I want to denormalize this database, ie I'm adding the value of column Name from table Users into table Items for performance reasons when querying the data. 想象一下,我想对该数据库进行非规范化 ,即出于查询数据的性能原因,我将表Users中的Name列的值添加到表Items中。 My current solution is the following:
我当前的解决方案如下:
items['User_Name'] = pd.Series([users.loc[users['User_ID']==x, 'Name'].iloc[0]
for x in items['Its_User_ID']])
That is, I'm adding the column as a Pandas Series constructed from a comprehension list, which uses .loc[] to retrieve the names of the users with a specific ID, and .iloc[0] to get the first element of the selection (which is the only one because user IDs are unique). 也就是说,我将该列添加为由理解列表构成的Pandas系列,该列表使用.loc []检索具有特定ID的用户名,并使用.iloc [0]获取该列的第一个元素选择(这是唯一的选择,因为用户ID是唯一的)。
But this solution is really slow for large sets of items. 但是,这种解决方案对于大量物品而言确实很慢。 I did the following tests:
我做了以下测试:
Because this approach is column-wise, its execution time grows multiplicatively by the number of columns for which I'm doing this process, and gets too time-expensive. 因为此方法是按列的,所以它的执行时间乘以我要为其执行此过程的列数成倍增长,并且变得太费时间。 While I haven't tried using for loops to fill the new Series row by row, I expect that it should be much more costly.
虽然我没有尝试使用for循环逐行填充新Series,但我希望它的成本更高。 Are there other approaches that I'm ignoring?
还有其他我忽略的方法吗? Is there a possible solution that takes a few minutes instead of a few hours?
是否有可能需要几分钟而不是几个小时的解决方案?
I think it would be more straightforward if you used table merges . 我认为如果您使用表合并会更直接。
items.merge(users[['User_ID', 'Name']], left_on='Its_User_ID', right_on='User_ID', how='left')
This will add the column Name to the new dataset, which you can of-course rename later. 这会将“名称”列添加到新数据集中,您当然可以稍后对其进行重命名。 This will be much more efficient that doing the operation via a for loop column-wise.
与通过for循环逐列进行操作相比,这将更加高效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.