[英]How can I drop NaN values as well as nearby non-Nan values from a df?
I have large CSVs (~100k rows x 30 cols).我有大 CSV(~100k 行 x 30 列)。 Occasionally the data has sections of
nan
values which span sections of the df
of various sizes.有时,数据包含
nan
值的部分,这些部分跨越不同大小的df
部分。 I need to drop the nan
s but also ~3 data points either side because the non- nan
data either side is borked.我需要删除
nan
但两边也有 ~3 个数据点,因为两边的非nan
数据都被破坏了。
One could drop any row containing a nan
but this would throw away more data than needs to be.可以删除包含
nan
任何行,但这会丢弃比需要更多的数据。
How can I do this with python?我怎么能用python做到这一点? The data has been loaded into a
df
.数据已加载到
df
。
Use:用:
df = pd.DataFrame({'col':['a','b','c', np.nan, 'd','e',np.nan, 's','r'],
'col1':4})
print (df)
col col1
0 a 4
1 b 4
2 c 4
3 NaN 4
4 d 4
5 e 4
6 NaN 4
7 s 4
8 r 4
#test at least one missing value
m = df.isna().any(axis=1)
#test row above and bellow match value by mask, chain by | for bitwise OR
#filter in inverted mask by ~ in boolean indexing
df = df[~(m | m.shift(fill_value=False) | m.shift(-1, fill_value=False))]
print (df)
col col1
0 a 4
1 b 4
8 r 4
Alternative solution:替代解决方案:
m = df.notna().all(axis=1)
df = df[(m & m.shift(fill_value=True) & m.shift(-1, fill_value=True))]
Here is another way if the number of rows to look above an below might change.如果要查看下方的行数可能发生变化,这是另一种方式。
l = 1
(df.loc[~df.isna().any(axis=1)
.replace(False,None,method = 'ffill',limit= l)
.replace(False,None,method = 'bfill',limit= l)])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.