使用来自另一个DataFrame的值将列有效地添加到Pandas DataFrame

Question

I have a simple database consisting of 2 tables (say, Items and Users), where a column of the Users is their User_ID , a column of the Items is their Item_ID and another column of the Items is a foreign key to a User_ID , for instance: 我有一个简单的数据库，其中包含2个表（例如，Items和Users），其中Users的一列是其User_ID ，Items的一列是其Item_ID ，Items的另一列是User_ID的外键 ，对于实例：

Items                                       Users
Item_ID  Value_A  Its_User_ID ...           User_ID  Name  ...
1        35       1                         1        Alice
2        991      1                         2        John
3        20       2

Imagine I want to denormalize this database, ie I'm adding the value of column Name from table Users into table Items for performance reasons when querying the data. 想象一下，我想对该数据库进行非规范化，即出于查询数据的性能原因，我将表Users中的Name列的值添加到表Items中。 My current solution is the following: 我当前的解决方案如下：

items['User_Name'] = pd.Series([users.loc[users['User_ID']==x, 'Name'].iloc[0] 
                     for x in items['Its_User_ID']])

That is, I'm adding the column as a Pandas Series constructed from a comprehension list, which uses .loc[] to retrieve the names of the users with a specific ID, and .iloc[0] to get the first element of the selection (which is the only one because user IDs are unique). 也就是说，我将该列添加为由理解列表构成的Pandas系列，该列表使用.loc []检索具有特定ID的用户名，并使用.iloc [0]获取该列的第一个元素选择（这是唯一的选择，因为用户ID是唯一的）。

But this solution is really slow for large sets of items. 但是，这种解决方案对于大量物品而言确实很慢。 I did the following tests: 我做了以下测试：

For 1000 items and ~200K users: 20 seconds. 对于1000个项目和约200K用户：20秒。
For ~400K items and ~200K users: 2.5 hours. 对于〜400K项和〜200K用户：2.5小时。 (and this is the real data size). （这是实际数据大小）。

Because this approach is column-wise, its execution time grows multiplicatively by the number of columns for which I'm doing this process, and gets too time-expensive. 因为此方法是按列的，所以它的执行时间乘以我要为其执行此过程的列数成倍增长，并且变得太费时间。 While I haven't tried using for loops to fill the new Series row by row, I expect that it should be much more costly. 虽然我没有尝试使用for循环逐行填充新Series，但我希望它的成本更高。 Are there other approaches that I'm ignoring? 还有其他我忽略的方法吗？ Is there a possible solution that takes a few minutes instead of a few hours? 是否有可能需要几分钟而不是几个小时的解决方案？

Answer 1

I think it would be more straightforward if you used table merges . 我认为如果您使用表合并会更直接。

items.merge(users[['User_ID', 'Name']], left_on='Its_User_ID', right_on='User_ID', how='left')

This will add the column Name to the new dataset, which you can of-course rename later. 这会将“名称”列添加到新数据集中，您当然可以稍后对其进行重命名。 This will be much more efficient that doing the operation via a for loop column-wise. 与通过for循环逐列进行操作相比，这将更加高效。

Answer 2

Use the high performance database operations provided by Panda, see here . 使用由熊猫提供高性能的数据库操作，见这里。

For example: 例如：

pd.merge(items, users, left_on='Its_User_ID', right_on='User_ID')

使用来自另一个DataFrame的值将列有效地添加到Pandas DataFrame

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-10-05 11:21:37

解决方案2
1 2018-10-05 11:22:40

使用来自另一个DataFrame的值将列有效地添加到Pandas DataFrame

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-10-05 11:21:37

解决方案2 1 2018-10-05 11:22:40

解决方案1
1 已采纳 2018-10-05 11:21:37

解决方案2
1 2018-10-05 11:22:40