I want to compare all rows of 2 given dataframes
how can i optimize the following code to dynamically iterate through all columns of the given pandas dataframe?
df1,df2 = pd.read_csv(...)
for index2, row2 in df2.iterrows():
for index1, row1 in df1.iterrows():
if row1[0]==row2[0]: i = i+1
if row1[1]==row2[1]: i = i+1
if row1[2]==row2[2]: i = i+1
if row1[3]==row2[3]: i = i+1
print("# same values: "+str(i))
i = 0
IIUC you need to check whether whole row of one dataframe is equal to another one. You could compare for equality two dataframes then use all
method for that with axis=1
to check rows and then summing the result:
df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'a': [1, 5, 3, 7, 5], 'b': [2, 3, 8, 5, 6]})
In [1531]: df1 == df2
Out[1531]:
a b
0 True True
1 False True
2 True False
3 False True
4 True True
In [1532]: (df1 == df2).all(axis=1)
Out[1532]:
0 True
1 False
2 False
3 False
4 True
dtype: bool
In [1533]: (df1 == df2).all(axis=1).sum()
Out[1533]: 2
result = (df1 == df2).all(axis=1).sum()
In [1535]: print("# same values: "+str(result))
# same values: 2
Your nested for
loop suggests that you are comparing all rows
of the first DataFrame
to all rows
of the second DataFrame
, and count the cases where values in corresponding columns match.
If so, you can fall back on numpy
broadcasting to sum
the equal cases for each row
in df1
relative to all rows
in df2
, and then sum
these for all rows
in df1
to get the total like so:
df1.apply(lambda x: np.sum(df2.values == x.values), axis=1)
To illustrate, two randomly sampled DataFrames
:
df1 = pd.DataFrame(np.random.randint(1, 5, (10, 2)))
0 1
0 2 4
1 2 3
2 4 1
3 3 3
4 3 3
5 4 4
6 2 4
7 3 4
8 3 4
9 4 1
df2 = pd.DataFrame(np.random.randint(1, 5, (10, 2)))
0 1
0 3 2
1 3 4
2 4 4
3 2 3
4 4 3
5 4 1
6 4 1
7 3 4
8 3 1
9 1 4
Get the sum
of equal values for all df1
rows
after comparing each to all df2
rows
:
df1.apply(lambda x: np.sum(df2.values == x.values), axis=1)
0 5
1 3
2 7
3 6
4 6
5 8
6 5
7 8
8 8
9 7
And you could then sum the cases, or do it all in one go:
df1.apply(lambda x: np.sum(df2.values == x.values), axis=1).sum()
63
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.