[英]iterate through all dataframe columns
I want to compare all rows of 2 given dataframes 我想比较2个给定数据帧的所有行
how can i optimize the following code to dynamically iterate through all columns of the given pandas dataframe? 如何优化以下代码以动态迭代给定pandas数据帧的所有列?
df1,df2 = pd.read_csv(...)
for index2, row2 in df2.iterrows():
for index1, row1 in df1.iterrows():
if row1[0]==row2[0]: i = i+1
if row1[1]==row2[1]: i = i+1
if row1[2]==row2[2]: i = i+1
if row1[3]==row2[3]: i = i+1
print("# same values: "+str(i))
i = 0
IIUC you need to check whether whole row of one dataframe is equal to another one. IIUC您需要检查一个数据帧的整行是否等于另一个数据帧。 You could compare for equality two dataframes then use
all
method for that with axis=1
to check rows and then summing the result: 您可以比较两个数据帧的相等性,然后使用
axis=1
all
方法来检查行,然后对结果求和:
df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'a': [1, 5, 3, 7, 5], 'b': [2, 3, 8, 5, 6]})
In [1531]: df1 == df2
Out[1531]:
a b
0 True True
1 False True
2 True False
3 False True
4 True True
In [1532]: (df1 == df2).all(axis=1)
Out[1532]:
0 True
1 False
2 False
3 False
4 True
dtype: bool
In [1533]: (df1 == df2).all(axis=1).sum()
Out[1533]: 2
result = (df1 == df2).all(axis=1).sum()
In [1535]: print("# same values: "+str(result))
# same values: 2
Your nested for
loop suggests that you are comparing all rows
of the first DataFrame
to all rows
of the second DataFrame
, and count the cases where values in corresponding columns match. 您的嵌套
for
循环暗示你都比较rows
的第一个DataFrame
的所有rows
的第二个DataFrame
,并计算的情况下,在相应的列匹配的值。
If so, you can fall back on numpy
broadcasting to sum
the equal cases for each row
in df1
relative to all rows
in df2
, and then sum
these for all rows
in df1
to get the total like so: 如果是这样,你可以依靠
numpy
广播来sum
相等的情况下为每row
在df1
相对于所有rows
的df2
,然后sum
这些所有rows
的df1
拿到总像这样:
df1.apply(lambda x: np.sum(df2.values == x.values), axis=1)
To illustrate, two randomly sampled DataFrames
: 为了说明,两个随机抽样的
DataFrames
:
df1 = pd.DataFrame(np.random.randint(1, 5, (10, 2)))
0 1
0 2 4
1 2 3
2 4 1
3 3 3
4 3 3
5 4 4
6 2 4
7 3 4
8 3 4
9 4 1
df2 = pd.DataFrame(np.random.randint(1, 5, (10, 2)))
0 1
0 3 2
1 3 4
2 4 4
3 2 3
4 4 3
5 4 1
6 4 1
7 3 4
8 3 1
9 1 4
Get the sum
of equal values for all df1
rows
after comparing each to all df2
rows
: 在将每个
df2
rows
与所有df2
rows
进行比较后,获取所有df1
rows
的相等值的sum
:
df1.apply(lambda x: np.sum(df2.values == x.values), axis=1)
0 5
1 3
2 7
3 6
4 6
5 8
6 5
7 8
8 8
9 7
And you could then sum the cases, or do it all in one go: 然后你可以对案例进行总结,或者一次性完成:
df1.apply(lambda x: np.sum(df2.values == x.values), axis=1).sum()
63
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.