Citizens of StackOverflow,
I am currently running iterations over a dataframe that can be millions of rows long. In each row of my dataframe I have leading NaNs (desired), followed by values. I want to only have X number of values in each row, followed by NaN's after that. Effectively I want a window of only X values, beginning with the first non-NaN and all other positions in the row will be NaN.
My solution is very slow. Additionally, I didn't find similar questions to be sufficiently helpful (most concerned just first/last NaN).
An example where the window size is 3:
import pandas as pd
import numpy as np
x = 3
data = {'2018Q3': [0, np.nan, np.nan, np.nan, np.nan],
'2018Q4': [1, np.nan, np.nan, np.nan, 10],
'2019Q1': [2, 3, np.nan, np.nan, 12],
'2019Q2': [3, 4, np.nan, 8, 14],
'2019Q3': [4, 5, np.nan, 9, 22]}
df = pd.DataFrame.from_dict(data)
print(df)
2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 0.0 1.0 2.0 3.0 4.0
1 NaN NaN 3.0 4.0 5.0
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 8.0 9.0
4 NaN 10.0 12.0 14.0 22.0
Results should look like this:
2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 0.0 1.0 2.0 NaN NaN
1 NaN NaN 3.0 4.0 5.0
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 8.0 9.0
4 NaN 10.0 12.0 14.0 NaN
MY SOLUTION:
def cut_excess_forecast(num_x, dataf):
Total_Col = len(dataf.columns.values) # total columns
df_NEW = pd.DataFrame()
for index, row in dataf.iterrows():
nas = row.isnull().sum(axis =0) # number of nulls
good_data = nas + num_x # gives number of columns that should be untouched
if good_data >= Total_Col: # if number of columns to not be touched > available columns, pass
pass # all data available is needed
else:
cutoff = Total_Col-good_data
row[-cutoff:] = np.nan #change to NaN excess columns in this row
df_NEW = df_NEW.append(row.copy()) #append changed row to new index
df_NEW.index = dataf.index #move over original index to the new dataframe
return df_NEW.copy()
df2 = cut_excess_forecast(x, df)
print(df2)
Sorting is allowed, so long as the index is untouched. Cheers and thanks in Advance.
Try:
df.where(df.notna().cumsum(1)<4)
Output:
2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 0.0 1.0 2.0 NaN NaN
1 NaN NaN 3.0 4.0 5.0
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 8.0 9.0
4 NaN 10.0 12.0 14.0 NaN
Explanation :
df.notna()
masks the NaN
values with False
and non- NaN
values with True
. 2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 True True True True True
1 False False True True True
2 False False False False False
3 False False False True True
4 False True True True True
cumsum(1)
will count the non- NaN
values on the rows from left to right. 2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 1 2 3 4 5
1 0 0 1 2 3
2 0 0 0 0 0
3 0 0 0 1 2
4 0 1 2 3 4
<4
to mask where the counts exceed the threshold 4
with False
2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 True True True False False
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True False
.where
to mask those cells with np.NaN
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.