将数据帧第一行的单元格值与其他行的单元格值进行比较

Question

I have a datafarme which has 50 columns and above 200 rows with binary values:我有一个数据场，它有 50 列和 200 行以上的二进制值：

a1  a2  a3  a4  ….. a50
0   1   0   1   ….. 1
1   0   0   1   ….  0
0   1   1   0   ….  0
1   1   1   0   ….  1

I would like to compare cell values of first row to other rows one by one and make the 51th column which output the non-matching cells as below: (since the first row is not compared with any row it will get a nan value)我想将第一行的单元格值与其他行一一比较，并使第 51 列输出不匹配的单元格，如下所示：（由于第一行未与任何行进行比较，因此将获得 nan 值）

a51
NAN
a1,a2,…,a50
a3,a4…,a50
a1,a3,a4,…

I am not sure how to do this efficiently.我不确定如何有效地做到这一点。 I have not find any answer similar to this question.我没有找到任何类似于这个问题的答案。 Sorry if I am asking repeated question.对不起，如果我问重复的问题。 Thank you in advance!先感谢您！

Answer 1

Setup设置

import numpy as np
df = pd.DataFrame(np.random.randint(2,size=(200,50)),
                  columns =[f'a{i}' for i in range(1,51)])

`Series.dot` + `DataFrame.add_suffix` and `Series.str.rstrip` `Series.dot` + `DataFrame.add_suffix`和`Series.str.rstrip`

df['a51']=df.iloc[1:].ne(df.iloc[0]).dot(df.add_suffix(', ').columns).str.rstrip(', ')

Time comparision for 50 columns and 200 rows 50 列 200 行的时间比较

%%timeit
df['a51'] = df.iloc[1:].ne(df.iloc[0]).dot(df.add_suffix(', ').columns).str.rstrip(', ')
25.4 ms ± 681 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


%%timeit
a = df.to_numpy()
m = np.where(a[0,:] != a[1:,None], df.columns, np.nan)
pd.DataFrame(m.squeeze()).stack().groupby(level=0).agg(', '.join)
41.1 ms ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


%%timeit
df.iloc[1:].apply(lambda row: df.columns[df.iloc[0] != row].values, axis=1)
147 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 2

Here's one approach:这是一种方法：

import numpy as np

a = df.to_numpy()
m = np.where(a[0,:] != a[1:,None], df.columns, np.nan)
pd.DataFrame(m.squeeze()).stack().groupby(level=0).agg(', '.join)

0    a1, a2, a50
1    a3, a4, a50
2     a1, a3, a4
dtype: object

Input data:输入数据：

print(df)

   a1  a2  a3  a4  a50
0   0   1   0   1    1
1   1   0   0   1    0
2   0   1   1   0    0
3   1   1   1   0    1

Answer 3

I assume you want the list of column names that don't match the first row:我假设您想要与第一行不匹配的列名列表：

df['a51'] = df.iloc[1:].apply(lambda row: df.columns[df.iloc[0] != row].values, axis=1)

200 rows is small enough so that apply(..., axis=1) is not a performance concern. 200 行足够小，因此apply(..., axis=1)不是性能问题。

将数据帧第一行的单元格值与其他行的单元格值进行比较

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-01-16 13:27:42

`Series.dot` + `DataFrame.add_suffix` and `Series.str.rstrip` `Series.dot` + `DataFrame.add_suffix`和`Series.str.rstrip`

解决方案2
1 2020-01-16 13:27:26

解决方案3
0 2020-01-16 13:34:01

将数据帧第一行的单元格值与其他行的单元格值进行比较

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-01-16 13:27:42

Series.dot + DataFrame.add_suffix and Series.str.rstrip Series.dot + DataFrame.add_suffix和Series.str.rstrip

解决方案2 1 2020-01-16 13:27:26

解决方案3 0 2020-01-16 13:34:01

解决方案1
2 已采纳 2020-01-16 13:27:42

`Series.dot` + `DataFrame.add_suffix` and `Series.str.rstrip` `Series.dot` + `DataFrame.add_suffix`和`Series.str.rstrip`

解决方案2
1 2020-01-16 13:27:26

解决方案3
0 2020-01-16 13:34:01