[英]Compare cell value of first row of a dataframe to cell value of other rows
I have a datafarme which has 50 columns and above 200 rows with binary values:我有一个数据场,它有 50 列和 200 行以上的二进制值:
a1 a2 a3 a4 ….. a50 0 1 0 1 ….. 1 1 0 0 1 …. 0 0 1 1 0 …. 0 1 1 1 0 …. 1
I would like to compare cell values of first row to other rows one by one and make the 51th column which output the non-matching cells as below: (since the first row is not compared with any row it will get a nan value)我想将第一行的单元格值与其他行一一比较,并使第 51 列输出不匹配的单元格,如下所示:(由于第一行未与任何行进行比较,因此将获得 nan 值)
a51 NAN a1,a2,…,a50 a3,a4…,a50 a1,a3,a4,…
I am not sure how to do this efficiently.我不确定如何有效地做到这一点。 I have not find any answer similar to this question.
我没有找到任何类似于这个问题的答案。 Sorry if I am asking repeated question.
对不起,如果我问重复的问题。 Thank you in advance!
先感谢您!
Setup设置
import numpy as np
df = pd.DataFrame(np.random.randint(2,size=(200,50)),
columns =[f'a{i}' for i in range(1,51)])
Series.dot
+ DataFrame.add_suffix
and Series.str.rstrip
Series.dot
+ DataFrame.add_suffix
和Series.str.rstrip
df['a51']=df.iloc[1:].ne(df.iloc[0]).dot(df.add_suffix(', ').columns).str.rstrip(', ')
Time comparision for 50 columns and 200 rows 50 列 200 行的时间比较
%%timeit
df['a51'] = df.iloc[1:].ne(df.iloc[0]).dot(df.add_suffix(', ').columns).str.rstrip(', ')
25.4 ms ± 681 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
a = df.to_numpy()
m = np.where(a[0,:] != a[1:,None], df.columns, np.nan)
pd.DataFrame(m.squeeze()).stack().groupby(level=0).agg(', '.join)
41.1 ms ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.iloc[1:].apply(lambda row: df.columns[df.iloc[0] != row].values, axis=1)
147 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here's one approach:这是一种方法:
import numpy as np
a = df.to_numpy()
m = np.where(a[0,:] != a[1:,None], df.columns, np.nan)
pd.DataFrame(m.squeeze()).stack().groupby(level=0).agg(', '.join)
0 a1, a2, a50
1 a3, a4, a50
2 a1, a3, a4
dtype: object
Input data:输入数据:
print(df)
a1 a2 a3 a4 a50
0 0 1 0 1 1
1 1 0 0 1 0
2 0 1 1 0 0
3 1 1 1 0 1
I assume you want the list of column names that don't match the first row:我假设您想要与第一行不匹配的列名列表:
df['a51'] = df.iloc[1:].apply(lambda row: df.columns[df.iloc[0] != row].values, axis=1)
200 rows is small enough so that apply(..., axis=1)
is not a performance concern. 200 行足够小,因此
apply(..., axis=1)
不是性能问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.