简体   繁体   English

Pandas按行查找第一个nan值并返回列名

[英]Pandas find first nan value by rows and return column name

I have a dataframe like this 我有这样的数据帧

>>df1 = pd.DataFrame({'A': ['1', '2', '3', '4','5'],
              'B': ['1', '1', '1', '1','1'],
              'C': ['c', 'A1', None, 'c3',None],
              'D': ['d0', 'B1', 'B2', None,'B4'],
              'E': ['A', None, 'S', None,'S'],
              'F': ['3', '4', '5', '6','7'],
              'G': ['2', '2', None, '2','2']})
>>df1

   A  B     C     D     E  F     G
0  1  1     c    d0     A  3     2
1  2  1    A1    B1  None  4     2
2  3  1  None    B2     S  5  None
3  4  1    c3  None  None  6     2
4  5  1  None    B4     S  7     2

and I drop the rows which contain nan values df2 = df1.dropna() 并删除包含nan值的行df2 = df1.dropna()

   A  B     C     D     E  F     G   
1  2  1    A1    B1  None  4     2
2  3  1  None    B2     S  5  None
3  4  1    c3  None  None  6     2
4  5  1  None    B4     S  7     2

This is a dropped dataframe due to those rows contain nan values. 由于这些行包含nan值,因此这是一个丢弃的数据帧。 However,I wanna know why they be dropped? 但是,我想知道为什么会被丢弃? Which column is the "first nan value column" made the row been dropped ? 哪一列是“第一个纳米值列”,该行被删除了? I need a dropped reason for report. 我需要一个失败的报告理由。

the output should be 输出应该是

['E','C','D','C']

I know I can do dropna by each column then record it as the reason but it's really non-efficient. 我知道我可以通过每一列做dropna然后记录它作为原因但它实际上是非效率的。

Is any more efficient way to solve this problem? 有没有更有效的方法来解决这个问题? Thank you 谢谢

I think you can create boolean dataframe by DataFrame.isnull , then filter by boolean indexing with mask where are at least one True by any and last idxmax - you get column names of first True values of DataFrame : 我想你可以通过创建布尔数据框DataFrame.isnull ,然后通过过滤boolean indexing与面具在哪里至少一个Trueany与去年idxmax -你得到第一的列名True的数值DataFrame

booldf = df1.isnull()
print (booldf)
       A      B      C      D      E      F      G
0  False  False  False  False  False  False  False
1  False  False  False  False   True  False  False
2  False  False   True  False  False  False   True
3  False  False  False   True   True  False  False
4  False  False   True  False  False  False  False

print (booldf.any(axis=1))
0    False
1     True
2     True
3     True
4     True
dtype: bool

print (booldf[booldf.any(axis=1)].idxmax(axis=1))
1    E
2    C
3    D
4    C
dtype: object

I would use a combination of itertools and numpy.where , along with pd.DataFrame.isnull : 我会使用itertoolsnumpy.where的组合,以及pd.DataFrame.isnull

>>> df1.isnull()
       A      B      C      D      E      F      G
0  False  False  False  False  False  False  False
1  False  False  False  False   True  False  False
2  False  False   True  False  False  False   True
3  False  False  False   True   True  False  False
4  False  False   True  False  False  False  False
>>> from itertools import *
>>> r,c = np.where(df1.isnull().values)
>>> first_cols = [next(g)[1] for _, g in groupby(izip(r,c), lambda t:t[0])]
>>> df1.columns[first_cols]
Index([u'E', u'C', u'D', u'C'], dtype='object')
>>> 

For Python 2, use izip from itertools , and in Python 3 simply use built-in zip . 对于Python 2,使用itertools izip ,而在Python 3中使用内置zip

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM