简体   繁体   中英

Replace null with last non-null value in pandas dataframe

I know the question has been asked several times before, but I am encountering a strange behaviour and hence the question.

Input df

   A         B  C
USA 21-07-2018  
USA 22-07-2018  
USA 23-07-2018  1
USA 24-07-2018  1
USA 25-07-2018  1
USA 26-07-2018  1
USA 27-07-2018  1
USA 28-07-2018  
USA 29-07-2018  
USA 30-07-2018  1
USA 31-07-2018  1
USA 01-08-2018  1
USA 02-08-2018  1
USA 03-08-2018  1
USA 04-08-2018  
USA 05-08-2018  
USA 06-08-2018  1
USA 07-08-2018  1
USA 08-08-2018  1
USA 09-08-2018  1
USA 10-08-2018  1
USA 11-08-2018  
USA 12-08-2018  
USA 13-08-2018  1
USA 14-08-2018  1
USA 15-08-2018  1
USA 16-08-2018  1
USA 17-08-2018  1
USA 18-08-2018  
USA 19-08-2018

I tried out the below two methods

1st Method

df['C'] = df['C'].fillna(method='ffill')

2nd Method

 df['C'] = df['C'].ffill()

Both of them resulted in the same dataframe(Output_df)

  A          B  C
USA 21-07-2017  1
USA 22-07-2017  3010.77
USA 23-07-2017  3010.77
USA 24-07-2017  1
USA 25-07-2017  1
USA 26-07-2017  1
USA 27-07-2017  1
USA 28-07-2017  1
USA 29-07-2017  2995.23
USA 30-07-2017  2995.23
USA 31-07-2017  1
USA 01-08-2017  1
USA 02-08-2017  1
USA 03-08-2017  1
USA 04-08-2017  1
USA 05-08-2017  2974.39
USA 06-08-2017  2974.39
USA 07-08-2017  1
USA 08-08-2017  1
USA 09-08-2017  1
USA 10-08-2017  1
USA 11-08-2017  1

Why am I getting value like 3010.77, 2974.39 etc. Is this being averaged out somewhere (input df is quite large >25k rows)?

What I expected it to be(Expected_df)

  A          B  C
USA 21-07-2018  1
USA 22-07-2018  1
USA 23-07-2018  1
USA 24-07-2018  1
USA 25-07-2018  1
USA 26-07-2018  1
USA 27-07-2018  1
USA 28-07-2018  1
USA 29-07-2018  1
USA 30-07-2018  1
USA 31-07-2018  1
USA 01-08-2018  1
USA 02-08-2018  1
USA 03-08-2018  1
USA 04-08-2018  1
USA 05-08-2018  1
USA 06-08-2018  1
USA 07-08-2018  1
USA 08-08-2018  1
USA 09-08-2018  1
USA 10-08-2018  1
USA 11-08-2018  1
USA 12-08-2018  1
USA 13-08-2018  1
USA 14-08-2018  1
USA 15-08-2018  1
USA 16-08-2018  1
USA 17-08-2018  1
USA 18-08-2018  1
USA 19-08-2018  1

Just to give another example of my expected output

Input df

  A          B         C
AUS 21-07-2017  1.262584
AUS 22-07-2017  
AUS 23-07-2017  
AUS 24-07-2017  1.258671
AUS 25-07-2017  1.256456
AUS 26-07-2017  1.263913
AUS 27-07-2017  1.249957
AUS 28-07-2017  1.256032
AUS 29-07-2017  
AUS 30-07-2017  
AUS 31-07-2017  1.254626
AUS 01-08-2017  1.254064
AUS 02-08-2017  1.255136
AUS 03-08-2017  1.259949
AUS 04-08-2017  1.254466
AUS 05-08-2017  
AUS 06-08-2017  
AUS 07-08-2017  1.263796
AUS 08-08-2017  1.259692
AUS 09-08-2017  1.268349
AUS 10-08-2017  1.269008
AUS 11-08-2017  1.271738

(Expected)Output df

  A          B         C
AUS 21-07-2017  1.262584
AUS 22-07-2017  1.262584
AUS 23-07-2017  1.262584
AUS 24-07-2017  1.258671
AUS 25-07-2017  1.256456
AUS 26-07-2017  1.263913
AUS 27-07-2017  1.249957
AUS 28-07-2017  1.256032
AUS 29-07-2017  1.256032
AUS 30-07-2017  1.256032
AUS 31-07-2017  1.254626
AUS 01-08-2017  1.254064
AUS 02-08-2017  1.255136
AUS 03-08-2017  1.259949
AUS 04-08-2017  1.254466
AUS 05-08-2017  1.254466
AUS 06-08-2017  1.254466
AUS 07-08-2017  1.263796
AUS 08-08-2017  1.259692
AUS 09-08-2017  1.268349
AUS 10-08-2017  1.269008
AUS 11-08-2017  1.271738

I think you have whitespaces in your column. You need to replace those with numpy.nan .

If you are unsure about how many blanks are there, you can do:

import numpy as np
df['C'].replace(r'^\s*$', np.nan, regex=True, inplace=True)

Then use ffill() for expected behaviour.

df['C'] = df['C'].ffill()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM