简体   繁体   中英

fastest way for searching and updating values for every rows in Pandas dataframe

I have a dataframe consists of transaction id, customer name and money spent, which looks like this:

id |  name      |    amount 
1  | Jennifer   |     598
2  | Jennifer   |     765
3  |  Matt      |     134
4  |  George    |     390
5  |  Jennifer  |     554
6  |  Matt      |     75
7  |  Matt      |     830
8  |  Matt      |     20
9  |  Bob       |     786
10 |  Bob       |     280
11 |  Sam       |     236
12 |  Sam       |     226
13 |  Bob       |     720
14 |  Bob       |     431
15 |  Jennifer  |     802
16 |  Ann       |     668
17 |  Sam       |     376
18 |  Jennifer  |     891
19 |  Ann       |     569
20 |  Jennifer  |     452

Now I want to make a new column called "amount1", which is the amount of money each custom spent last time he made a purchase, and the result will look like this:

id  | name     |   amount   |     amount1
1   | Jennifer |    598     |
2   | Jennifer |    765     |      598
3   | Matt     |    134     |
4   | George   |    390     |
5   | Jennifer |    554     |      765
6   | Matt     |    75      |      134
7   | Matt     |    830     |      75
8   | Matt     |    20      |      830
9   | Bob      |    786     |   
10  | Bob      |    280     |      786
11  | Sam      |    236     |   
12  | Sam      |    226     |      236
13  | Bob      |    720     |      786
14  | Bob      |    431     |      720
15  | Jennifer |    802     |      554
16  | Ann      |    668     |   
17  | Sam      |    376     |      226
18  | Jennifer |    891     |      802
19  | Ann      |    569     |      668
20  | Jennifer |    452     |      891

It is just iterating every row and search for all previous purchase record and update 'amount1' with the most recent purchase record.

I have tried with the code below, but i have about 200k rows of data, and it takes few hours to run. What is the most efficient way of doing this task?

df['amount1'] = np.nan 

for index, row in df.iterrows():

  purchase_id = row['id']
  customer_name = row['name']
  amt = df.query('id<@purchase_id and name==@customer')['amount'].values

  if len(amt)>0:
    df.loc[index,'amount1'] = amt[-1]

Use DataFrameGroupBy.shift :

df['amount1'] = df.groupby(['name'])['amount'].shift()
print (df)
    id      name  amount  amount1
0    1  Jennifer     598      NaN
1    2  Jennifer     765    598.0
2    3      Matt     134      NaN
3    4    George     390      NaN
4    5  Jennifer     554    765.0
5    6      Matt      75    134.0
6    7      Matt     830     75.0
7    8      Matt      20    830.0
8    9       Bob     786      NaN
9   10       Bob     280    786.0
10  11       Sam     236      NaN
11  12       Sam     226    236.0
12  13       Bob     720    280.0
13  14       Bob     431    720.0
14  15  Jennifer     802    554.0
15  16       Ann     668      NaN
16  17       Sam     376    226.0
17  18  Jennifer     891    802.0
18  19       Ann     569    668.0
19  20  Jennifer     452    891.0

If need shift only positive amount values use:

s = df['amount'].where(df['amount'] > 0)
df['amount1'] = s.groupby(df['name']).shift()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM