在 pandas 數據框中的每一行中找到非零值的列索引集

Question

有沒有一種好方法可以找到熊貓數據框中每一行中非零值的列索引集？ 我必須逐行遍歷數據框嗎？

例如，數據框是

c1  c2  c3  c4 c5 c6 c7 c8  c9
 1   1   0   0  0  0  0  0   0
 1   0   0   0  0  0  0  0   0
 0   1   0   0  0  0  0  0   0
 1   0   0   0  0  0  0  0   0
 0   1   0   0  0  0  0  0   0
 0   0   0   0  0  0  0  0   0
 0   2   1   1  1  1  1  0   2
 1   5   5   0  0  1  0  4   6
 4   3   0   1  1  1  1  5  10
 3   5   2   4  1  2  2  1   3
 6   4   0   1  0  0  0  0   0
 3   9   1   0  1  0  2  1   0

output 預計為

['c1','c2']
['c1']
['c2']
...

Answer 1

看來您必須逐行遍歷 DataFrame。

cols = df.columns
bt = df.apply(lambda x: x > 0)
bt.apply(lambda x: list(cols[x.values]), axis=1)

你會得到：

0                                 [c1, c2]
1                                     [c1]
2                                     [c2]
3                                     [c1]
4                                     [c2]
5                                       []
6             [c2, c3, c4, c5, c6, c7, c9]
7                 [c1, c2, c3, c6, c8, c9]
8         [c1, c2, c4, c5, c6, c7, c8, c9]
9     [c1, c2, c3, c4, c5, c6, c7, c8, c9]
10                            [c1, c2, c4]
11                [c1, c2, c3, c5, c7, c8]
dtype: object

如果性能很重要，請嘗試將raw=True傳遞給布爾數據幀創建，如下所示：

%timeit df.apply(lambda x: x > 0, raw=True).apply(lambda x: list(cols[x.values]), axis=1)
1000 loops, best of 3: 812 µs per loop

它為您帶來更好的性能提升。 以下是raw=False （這是默認值）結果：

%timeit df.apply(lambda x: x > 0).apply(lambda x: list(cols[x.values]), axis=1)
100 loops, best of 3: 2.59 ms per loop

Answer 2

潛在更好的數據結構（而不是一系列列表）是堆棧：

In [11]: res = df[df!=0].stack()

In [12]: res
Out[12]:
0   c1     1
    c2     1
1   c1     1
2   c2     1
3   c1     1
...

您可以遍歷原始行：

In [13]: res.loc[0]
Out[13]:
c1    1
c2    1
dtype: float64

In [14]: res.loc[0].index
Out[14]: Index(['c1', 'c2'], dtype='object')

注意：我認為您曾經能夠在應用程序中返回一個列表（以創建一個具有列表元素的 DataFrame），但現在似乎不再如此。

Answer 3

這種方法怎么樣？

#create a True / False data frame
df_boolean = df>0

#a little helper method that uses boolean slicing internally 
def bar(x,columns):
    return ','.join(list(columns[x]))

#use an apply along the column axis
df_boolean['result'] = df_boolean.apply(lambda x: bar(x,df_boolean.columns),axis=1)

# filter out the empty "rows" adn grab the result column
df_result =  df_boolean[df_boolean['result'] != '']['result']

#append an axis, just so each line will will output a list 
lst_result = df_result.values[:,np.newaxis]

print '\n'.join([ str(myelement) for myelement in lst_result])

這會產生：

['c1,c2']
['c1']
['c2']
['c1']
['c2']
['c2,c3,c4,c5,c6,c7,c9']
['c1,c2,c3,c6,c8,c9']
['c1,c2,c4,c5,c6,c7,c8,c9']
['c1,c2,c3,c4,c5,c6,c7,c8,c9']
['c1,c2,c4']
['c1,c2,c3,c5,c7,c8']

Answer 4

如果您只想定位非零值，則 numpy.argwhere() 和 nonzero() 都是單行的。

nzero = np.argwhere(df.to_numpy())
# nzero is an array of two-element arrays [irow, icol]
nz = df.to_numpy().nonzero()
# Alternatively, nz is a duple of numpy 1D-arrays of corresponding indices

但是要獲得按行要求的 output ，我想不出一種方法來避免行上的循環。 接受的答案要短得多。

pairit = iter(nzero)
pair = next(pairit)
for irow in range(len(df)):
    # want one list for each row
    cols = []
    while pair[0] == irow:
        cols.append(df.columns[pair[1]])
        try:
            pair = next(pairit)
        except StopIteration:
            break
    print(irow, cols)

在 pandas 數據框中的每一行中找到非零值的列索引集

問題描述

4 個解決方案

解決方案1
10 已采納 2015-09-24 19:50:38

解決方案2
3 2015-09-24 19:02:25

解決方案3
2 2015-09-24 20:07:49

解決方案4
0 2022-08-05 18:43:24

在 pandas 數據框中的每一行中找到非零值的列索引集

問題描述

4 個解決方案

解決方案1 10 已采納 2015-09-24 19:50:38

解決方案2 3 2015-09-24 19:02:25

解決方案3 2 2015-09-24 20:07:49

解決方案4 0 2022-08-05 18:43:24

解決方案1
10 已采納 2015-09-24 19:50:38

解決方案2
3 2015-09-24 19:02:25

解決方案3
2 2015-09-24 20:07:49

解決方案4
0 2022-08-05 18:43:24