[英]Pandas find first nan value by rows and return column name
I have a dataframe like this 我有这样的数据帧
>>df1 = pd.DataFrame({'A': ['1', '2', '3', '4','5'],
'B': ['1', '1', '1', '1','1'],
'C': ['c', 'A1', None, 'c3',None],
'D': ['d0', 'B1', 'B2', None,'B4'],
'E': ['A', None, 'S', None,'S'],
'F': ['3', '4', '5', '6','7'],
'G': ['2', '2', None, '2','2']})
>>df1
A B C D E F G
0 1 1 c d0 A 3 2
1 2 1 A1 B1 None 4 2
2 3 1 None B2 S 5 None
3 4 1 c3 None None 6 2
4 5 1 None B4 S 7 2
and I drop the rows which contain nan values df2 = df1.dropna()
并删除包含nan值的行
df2 = df1.dropna()
A B C D E F G
1 2 1 A1 B1 None 4 2
2 3 1 None B2 S 5 None
3 4 1 c3 None None 6 2
4 5 1 None B4 S 7 2
This is a dropped dataframe due to those rows contain nan values. 由于这些行包含nan值,因此这是一个丢弃的数据帧。 However,I wanna know why they be dropped?
但是,我想知道为什么会被丢弃? Which column is the "first nan value column" made the row been dropped ?
哪一列是“第一个纳米值列”,该行被删除了? I need a dropped reason for report.
我需要一个失败的报告理由。
the output should be 输出应该是
['E','C','D','C']
I know I can do dropna
by each column then record it as the reason but it's really non-efficient. 我知道我可以通过每一列做
dropna
然后记录它作为原因但它实际上是非效率的。
Is any more efficient way to solve this problem? 有没有更有效的方法来解决这个问题? Thank you
谢谢
I think you can create boolean dataframe by DataFrame.isnull
, then filter by boolean indexing
with mask where are at least one True
by any
and last idxmax
- you get column names of first True
values of DataFrame
: 我想你可以通过创建布尔数据框
DataFrame.isnull
,然后通过过滤boolean indexing
与面具在哪里至少一个True
由any
与去年idxmax
-你得到第一的列名True
的数值DataFrame
:
booldf = df1.isnull()
print (booldf)
A B C D E F G
0 False False False False False False False
1 False False False False True False False
2 False False True False False False True
3 False False False True True False False
4 False False True False False False False
print (booldf.any(axis=1))
0 False
1 True
2 True
3 True
4 True
dtype: bool
print (booldf[booldf.any(axis=1)].idxmax(axis=1))
1 E
2 C
3 D
4 C
dtype: object
I would use a combination of itertools
and numpy.where
, along with pd.DataFrame.isnull
: 我会使用
itertools
和numpy.where
的组合,以及pd.DataFrame.isnull
:
>>> df1.isnull()
A B C D E F G
0 False False False False False False False
1 False False False False True False False
2 False False True False False False True
3 False False False True True False False
4 False False True False False False False
>>> from itertools import *
>>> r,c = np.where(df1.isnull().values)
>>> first_cols = [next(g)[1] for _, g in groupby(izip(r,c), lambda t:t[0])]
>>> df1.columns[first_cols]
Index([u'E', u'C', u'D', u'C'], dtype='object')
>>>
For Python 2, use izip
from itertools
, and in Python 3 simply use built-in zip
. 对于Python 2,使用
itertools
izip
,而在Python 3中使用内置zip
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.