简体   繁体   English

Pandas 列列表中每行的第一个非空值

[英]First non-null value per row from a list of Pandas columns

If I've got a DataFrame in pandas which looks something like:如果我在 Pandas 中有一个 DataFrame,它看起来像:

    A   B   C
0   1 NaN   2
1 NaN   3 NaN
2 NaN   4   5
3 NaN NaN NaN

How can I get the first non-null value from each row?如何从每一行获取第一个非空值? Eg for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).例如,对于上述内容,我想得到: [1, 3, 4, None] (或等效的系列)。

fillna填充左边的fillna ,然后获取最左边的列:

df.fillna(method='bfill', axis=1).iloc[:, 0]

This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:这是一种非常混乱的方法,首先使用first_valid_index获取有效列,将返回的系列转换为数据帧,以便我们可以调用apply row-wise 并使用它来索引回原始 df:

In [160]:
def func(x):
    if x.values[0] is None:
        return None
        return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
0     1
1     3
2     4
3   NaN
dtype: float64


A slightly cleaner way:一个稍微干净的方法:

In [12]:
def func(x):
    if x.first_valid_index() is None:
        return None
        return x[x.first_valid_index()]
df.apply(func, axis=1)

0     1
1     3
2     4
3   NaN
dtype: float64

I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods.我将在这里权衡,因为我认为这比任何提议的方法都要快得多。 argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. argmin给人的第一索引False值的结果的每一行中np.isnan在向量化方式,这是困难的。 It still relies on a Python loop to extract the values but the look up is very quick:它仍然依赖 Python 循环来提取值,但查找速度非常快:

def get_first_non_null(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return [a[row, col] for row, col in enumerate(col_index)]

EDIT: Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input.编辑:这是一个完全矢量化的解决方案,根据输入的形状,它可以再次快得多。 Updated benchmarking below.更新了下面的基准测试。

def get_first_non_null_vec(df):
    a = df.values
    n_rows, n_cols = a.shape
    col_index = np.isnan(a).argmin(axis=1)
    flat_index = n_cols * np.arange(n_rows) + col_index
    return a.ravel()[flat_index]

If a row is completely null then the corresponding value will be null also.如果一行完全为空,那么相应的值也将为空。 Here's some benchmarking against unutbu's solution:以下是针对 unutbu 解决方案的一些基准测试:

df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:

df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop

df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop

Here is another way to do it:这是另一种方法:

In [183]: df.stack().groupby(level=0).first().reindex(df.index)
0     1
1     3
2     4
3   NaN
dtype: float64

The idea here is to use stack to move the columns into a row index level:这里的想法是使用stack将列移动到行索引级别:

In [184]: df.stack()
0  A    1
   C    2
1  B    3
2  B    4
   C    5
dtype: float64

Now, if you group by the first row level -- ie the original index -- and take the first value from each group, you essentially get the desired result:现在,如果您按第一行级别(即原始索引)进行分组并从每个组中获取第一个值,您基本上会得到所需的结果:

In [185]: df.stack().groupby(level=0).first()
0    1
1    3
2    4
dtype: float64

All we need to do is reindex the result (using the original index) so as to include rows that are completely NaN:我们需要做的就是重新索引结果(使用原始索引)以包含完全 NaN 的行:


This is nothing new, but it's a combination of the best bits of @yangie's approach with a list comprehension, and @EdChum's df.apply approach that I think is easiest to understand.这并不是什么新鲜事,但它结合了@yangie 方法的最佳部分与列表理解,以及我认为最容易理解的@EdChum 的df.apply方法

First, which columns to we want to pick our values from?首先,我们想从哪些列中选择我们的值?

In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)

In [96]: pick_cols
0       A
1       B
2       B
3    None
dtype: object

Now how do we pick the values?现在我们如何选择值?

In [100]: [df.loc[k, v] if v is not None else None 
    ....:     for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]

This is ok, but we really want the index to match that of the original DataFrame :这没问题,但我们真的希望索引与原始DataFrame的索引匹配:

In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
   ....:     for k, v in pick_cols.iteritems()})
0     1
1     3
2     4
3   NaN
dtype: float64

groupby in axis=1 axis=1 groupby axis=1

If we pass a callable that returns the same value, we group all columns together.如果我们传递一个返回相同值的可调用对象,我们会将所有列组合在一起。 This allows us to use groupby.agg which gives us the first method that makes this easy这允许我们使用groupby.agg ,它为我们提供了first使这变得容易的方法

df.groupby(lambda x: 'Z', 1).first()

0  1.0
1  3.0
2  4.0
3  NaN

This returns a dataframe with the column name of the thing I was returning in my callable这将返回一个数据框,其中包含我在可调用对象中返回的内容的列名

lookup , notna , and idxmax lookupnotnaidxmax

df.lookup(df.index, df.notna().idxmax(1))

array([ 1.,  3.,  4., nan])

argmin and slicing argmin和切片

v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]

array([ 1.,  3.,  4., nan])

Here is a one line solution:这是一个单行解决方案:

[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]


This solution iterates over rows of df .此解决方案迭代df行。 row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row. row.first_valid_index()返回第一个非 NA/null 值的标签,它将用作索引以获取每行中的第一个非空项目。

If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.如果行中没有非空值, row.first_valid_index()将为 None,因此不能用作索引,所以我需要一个if-else语句。

I packed everything into a list comprehension for brevity.为简洁起见,我将所有内容都打包到列表理解中。

JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration: JoeCondron 的回答(编辑:在他最后一次编辑之前!)很酷,但通过避免非矢量化枚举有显着改进的余地:

def get_first_non_null_vect(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return a[np.arange(a.shape[0]), col_index]

The improvement is small if the DataFrame is relatively flat:如果 DataFrame 相对平坦,则改进很小:

In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))

In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop

In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop

... but can be relevant on slim DataFrames: ...但可能与纤薄的 DataFrame 相关:

In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))

In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop

In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop

Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).与 JoeCondron 的矢量化版本相比,运行时间非常相似(对于细长的 DataFrames 仍然稍微快一点,对于大的数据帧稍微慢一点)。

df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})

     A    B    C
0  1.0  NaN  2.0
1  NaN  3.0  NaN
2  NaN  4.0  5.0
3  NaN  NaN  NaN

df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM