简体   繁体   中英

Efficiently add column to Pandas DataFrame with values from another DataFrame

I have a simple database consisting of 2 tables (say, Items and Users), where a column of the Users is their User_ID , a column of the Items is their Item_ID and another column of the Items is a foreign key to a User_ID , for instance:

Items                                       Users
Item_ID  Value_A  Its_User_ID ...           User_ID  Name  ...
1        35       1                         1        Alice
2        991      1                         2        John
3        20       2  

Imagine I want to denormalize this database, ie I'm adding the value of column Name from table Users into table Items for performance reasons when querying the data. My current solution is the following:

items['User_Name'] = pd.Series([users.loc[users['User_ID']==x, 'Name'].iloc[0] 
                     for x in items['Its_User_ID']])

That is, I'm adding the column as a Pandas Series constructed from a comprehension list, which uses .loc[] to retrieve the names of the users with a specific ID, and .iloc[0] to get the first element of the selection (which is the only one because user IDs are unique).

But this solution is really slow for large sets of items. I did the following tests:

  • For 1000 items and ~200K users: 20 seconds.
  • For ~400K items and ~200K users: 2.5 hours. (and this is the real data size).

Because this approach is column-wise, its execution time grows multiplicatively by the number of columns for which I'm doing this process, and gets too time-expensive. While I haven't tried using for loops to fill the new Series row by row, I expect that it should be much more costly. Are there other approaches that I'm ignoring? Is there a possible solution that takes a few minutes instead of a few hours?

I think it would be more straightforward if you used table merges .

items.merge(users[['User_ID', 'Name']], left_on='Its_User_ID', right_on='User_ID', how='left')

This will add the column Name to the new dataset, which you can of-course rename later. This will be much more efficient that doing the operation via a for loop column-wise.

Use the high performance database operations provided by Panda, see here .

For example:

pd.merge(items, users, left_on='Its_User_ID', right_on='User_ID')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM