简体   繁体   English

使用来自另一个DataFrame的值将列有效地添加到Pandas DataFrame

[英]Efficiently add column to Pandas DataFrame with values from another DataFrame

I have a simple database consisting of 2 tables (say, Items and Users), where a column of the Users is their User_ID , a column of the Items is their Item_ID and another column of the Items is a foreign key to a User_ID , for instance: 我有一个简单的数据库,其中包含2个表(例如,Items和Users),其中Users的一列是其User_ID ,Items的一列是其Item_ID ,Items的另一列是User_ID的外键 ,对于实例:

Items                                       Users
Item_ID  Value_A  Its_User_ID ...           User_ID  Name  ...
1        35       1                         1        Alice
2        991      1                         2        John
3        20       2  

Imagine I want to denormalize this database, ie I'm adding the value of column Name from table Users into table Items for performance reasons when querying the data. 想象一下,我想对该数据库进行非规范化 ,即出于查询数据的性能原因,我将表Users中的Name列的值添加到表Items中。 My current solution is the following: 我当前的解决方案如下:

items['User_Name'] = pd.Series([users.loc[users['User_ID']==x, 'Name'].iloc[0] 
                     for x in items['Its_User_ID']])

That is, I'm adding the column as a Pandas Series constructed from a comprehension list, which uses .loc[] to retrieve the names of the users with a specific ID, and .iloc[0] to get the first element of the selection (which is the only one because user IDs are unique). 也就是说,我将该列添加为由理解列表构成的Pandas系列,该列表使用.loc []检索具有特定ID的用户名,并使用.iloc [0]获取该列的第一个元素选择(这是唯一的选择,因为用户ID是唯一的)。

But this solution is really slow for large sets of items. 但是,这种解决方案对于大量物品而言确实很慢。 I did the following tests: 我做了以下测试:

  • For 1000 items and ~200K users: 20 seconds. 对于1000个项目和约200K用户:20秒。
  • For ~400K items and ~200K users: 2.5 hours. 对于〜400K项和〜200K用户:2.5小时。 (and this is the real data size). (这是实际数据大小)。

Because this approach is column-wise, its execution time grows multiplicatively by the number of columns for which I'm doing this process, and gets too time-expensive. 因为此方法是按列的,所以它的执行时间乘以我要为其执行此过程的列数成倍增长,并且变得太费时间。 While I haven't tried using for loops to fill the new Series row by row, I expect that it should be much more costly. 虽然我没有尝试使用for循环逐行填充新Series,但我希望它的成本更高。 Are there other approaches that I'm ignoring? 还有其他我忽略的方法吗? Is there a possible solution that takes a few minutes instead of a few hours? 是否有可能需要几分钟而不是几个小时的解决方案?

I think it would be more straightforward if you used table merges . 我认为如果您使用表合并会更直接。

items.merge(users[['User_ID', 'Name']], left_on='Its_User_ID', right_on='User_ID', how='left')

This will add the column Name to the new dataset, which you can of-course rename later. 这会将“名称”列添加到新数据集中,您当然可以稍后对其进行重命名。 This will be much more efficient that doing the operation via a for loop column-wise. 与通过for循环逐列进行操作相比,这将更加高效。

Use the high performance database operations provided by Panda, see here . 使用由熊猫提供高性能的数据库操作,见这里

For example: 例如:

pd.merge(items, users, left_on='Its_User_ID', right_on='User_ID')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有效地将值从一列替换到另一列 Pandas DataFrame - Efficiently replace values from a column to another column Pandas DataFrame 向 pandas dataframe 添加一个新列,其中包含来自另一列的转换值? - Add a new column to pandas dataframe with coverted values from another column? 如何在熊猫中添加来自其他数据框的值的列 - How to add a column in pandas with values taken from another dataframe pandas 在列值匹配时使用来自另一个数据帧的值更新数据帧 - pandas update a dataframe with values from another dataframe on the match of column values 基于两列值有效地从熊猫数据框中提取信息 - Efficiently extracting information from a pandas dataframe based on two column values 熊猫:从另一列修改数据框中的值 - pandas: modifying values in dataframe from another column 尝试有条件地将列添加到 pandas dataframe 来自另一个相关的 dataframe - Trying to conditionally add a column to a pandas dataframe from another related dataframe Pandas:根据另一个数据框中的值更新数据框中的多列 - Pandas : Updating multiple column in a dataframe based on values from another dataframe Pandas:将从 DataFrame 中提取的值乘以另一个 DataFrame 中的列值 - Pandas: Multiplying a value extracted from a DataFrame to column values in another DataFrame 熊猫:从另一个数据框中的列值插入数据框中的行 - pandas: insert rows in a dataframe from column values in another dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM