简体   繁体   English

遍历所有数据帧列

[英]iterate through all dataframe columns

I want to compare all rows of 2 given dataframes 我想比较2个给定数据帧的所有行

how can i optimize the following code to dynamically iterate through all columns of the given pandas dataframe? 如何优化以下代码以动态迭代给定pandas数据帧的所有列?

df1,df2 = pd.read_csv(...)

for index2, row2 in df2.iterrows():
    for index1, row1 in df1.iterrows():
        if row1[0]==row2[0]: i = i+1
        if row1[1]==row2[1]: i = i+1
        if row1[2]==row2[2]: i = i+1
        if row1[3]==row2[3]: i = i+1
        print("# same values: "+str(i))
        i = 0

IIUC you need to check whether whole row of one dataframe is equal to another one. IIUC您需要检查一个数据帧的整行是否等于另一个数据帧。 You could compare for equality two dataframes then use all method for that with axis=1 to check rows and then summing the result: 您可以比较两个数据帧的相等性,然后使用axis=1 all方法来检查行,然后对结果求和:

df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'a': [1, 5, 3, 7, 5], 'b': [2, 3, 8, 5, 6]})

In [1531]: df1 == df2
Out[1531]: 
       a      b
0   True   True
1  False   True
2   True  False
3  False   True
4   True   True

In [1532]: (df1 == df2).all(axis=1)
Out[1532]: 
0     True
1    False
2    False
3    False
4     True
dtype: bool

In [1533]: (df1 == df2).all(axis=1).sum()
Out[1533]: 2

result = (df1 == df2).all(axis=1).sum()

In [1535]: print("# same values: "+str(result))
# same values: 2

Your nested for loop suggests that you are comparing all rows of the first DataFrame to all rows of the second DataFrame , and count the cases where values in corresponding columns match. 您的嵌套for循环暗示你都比较rows的第一个DataFrame的所有rows的第二个DataFrame ,并计算的情况下,在相应的列匹配的值。

If so, you can fall back on numpy broadcasting to sum the equal cases for each row in df1 relative to all rows in df2 , and then sum these for all rows in df1 to get the total like so: 如果是这样,你可以依靠numpy广播来sum相等的情况下为每rowdf1相对于所有rowsdf2 ,然后sum这些所有rowsdf1拿到总像这样:

df1.apply(lambda x: np.sum(df2.values == x.values), axis=1)

To illustrate, two randomly sampled DataFrames : 为了说明,两个随机抽样的DataFrames

df1 = pd.DataFrame(np.random.randint(1, 5, (10, 2)))

   0  1
0  2  4
1  2  3
2  4  1
3  3  3
4  3  3
5  4  4
6  2  4
7  3  4
8  3  4
9  4  1

df2 = pd.DataFrame(np.random.randint(1, 5, (10, 2)))

   0  1
0  3  2
1  3  4
2  4  4
3  2  3
4  4  3
5  4  1
6  4  1
7  3  4
8  3  1
9  1  4

Get the sum of equal values for all df1 rows after comparing each to all df2 rows : 在将每个df2 rows与所有df2 rows进行比较后,获取所有df1 rows的相等值的sum

df1.apply(lambda x: np.sum(df2.values == x.values), axis=1)

0    5
1    3
2    7
3    6
4    6
5    8
6    5
7    8
8    8
9    7

And you could then sum the cases, or do it all in one go: 然后你可以对案例进行总结,或者一次性完成:

df1.apply(lambda x: np.sum(df2.values == x.values), axis=1).sum()

63

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM