Pandas how do I compare columns for equality in a dataframe that have the same name except for a suffix?

Question

I have a dataframe that has a lot of columns in the following format:

id	col1_df1	col2_df1	col3_df1	col1_df2	col2_df2	col3_df2
A1	num	num	num	num	num	num
A2	num	num	num	num	num	num
A3	num	num	num	num	num	num
A4	num	num	num	num	num	num
A5	num	num	num	num	num	num

I need to compare the columns with the same name before the prefix. That is, I need to compare col1_df1 with col1_df2 , col2_df1 with col2_df2 , and so on. I need to check the percentage of equality in each "repeated" column (in each col1, in each col2, and so on). I can't do it manually because there are a lot of columns.

Is there a way to do that using Pandas functionality?

Thanks!

Answer 1

(df.groupby(lambda col: col.rpartition("_")[0], axis="columns")
   .diff().dropna(how="all", axis="columns")
   .eq(0).sum()
   .div(len(df)).mul(100)
   .rename(lambda idx: idx.rpartition("_")[0]))

group the dataframe over columns' names before the last "_"
take the difference of each group within itself
- since there's nothing left to first columns of each group, they will be NaNs, so drop them
check where differences are equal to 0, ie, repetition: sum gives the count
then percentify it
lastly obtain the group names (before the last _) for the end result

to get something like (for some random data)

col1    40.0
col2     0.0
col3    33.3
dtype: float64

Answer 2

Since there are already some reasonable answers, I'll take a guess and give a slightly different answer just in case it is useful.

Based on your column names it seems you are getting this dataframe from two dataframes where both have columns like: col1 , col2 , etc. If that's the case, I would take a step back and do it like this:

In [33]: df1 = pd.DataFrame(np.random.randint(4, size=(5, 3)),
    ...:       columns=[f"col1", "col2", "col3"],
    ...:       index=["A1", "A2", "A3", "A4", "A5"])

In [34]: df2 = pd.DataFrame(np.random.randint(4, size=(5, 3)),
    ...:       columns=[f"col1", "col2", "col3"],
    ...:       index=["A1", "A2", "A3", "A4", "A5"])

In [35]: df1
Out[35]: 
    col1  col2  col3
A1     0     0     1
A2     3     1     0
A3     0     2     3
A4     1     1     0
A5     3     0     1

In [36]: df2 
Out[36]: 
    col1  col2  col3
A1     1     2     3
A2     3     2     2
A3     3     0     3
A4     0     3     2
A5     2     0     1

In [37]: df2 == df1
Out[37]: 
     col1   col2   col3
A1  False  False  False
A2   True  False  False
A3  False  False   True
A4  False  False  False
A5  False   True   True

In [38]: res = df2 == df1

In [39]: res.sum(axis=0)
Out[39]: 
col1    1
col2    1
col3    2
dtype: int64

To get the above as percentage, you could do this:

In [40]: res.sum(axis=0) / len(res) * 100
Out[40]: 
col1    20.0
col2    20.0
col3    40.0
dtype: float64

Answer 3

For instance let's say we have the following dataframe:

import pandas as pd

df = pd.DataFrame({'ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
                   'col1_df1': [11, 11, 11, 11, 11],
                   'col2_df1': [10, 11, 12, 13, 14],
                   'col3_df1': [9, 11, 9, 11, 9],
                   'col1_df2': [8, 11, 7, 10, 6],
                   'col2_df2': [7, 8, 9, 10, 11],
                   'col3_df2': [6, 11, 6, 11, 6],
                   })

Visualization:

   ID  col1_df1  col2_df1  col3_df1  col1_df2  col2_df2  col3_df2
0  A1        11        10         9         8         7         6
1  A2        11        11        11        11         8        11
2  A3        11        12         9         7         9         6
3  A4        11        13        11        10        10        11
4  A5        11        14         9         6        11         6

Comparison with using of the following script:

r = (df.groupby(lambda x: x.split('_')[0], axis=1)
     .diff(axis=1)
     .dropna(how='all', axis=1) 
     .rename(columns = {'col1_df2':'diff1', 'col2_df2':'diff2', 'col3_df2':'diff3'}, inplace = False))

And then r is:

   diff1  diff2  diff3
0     -3     -3     -3
1      0     -3      0
2     -4     -3     -3
3     -1     -3      0
4     -5     -3     -3

In r , 0 represents equality.

In r column named diff1 (comparison between col1_df1 and col1_df2 ), we get perc of 0 as follows:

perc = float(eval('{0}/{1}'.format(r['diff1'].value_counts()[0], len(r['diff1']))))

print(per)
# 0.2 (ie 1 out of 5)

Complete proposed script:

import pandas as pd

df = pd.DataFrame({'ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
                   'col1_df1': [11, 11, 11, 11, 11],
                   'col2_df1': [10, 11, 12, 13, 14],
                   'col3_df1': [9, 11, 9, 11, 9],
                   'col1_df2': [8, 11, 7, 10, 6],
                   'col2_df2': [7, 8, 9, 10, 11],
                   'col3_df2': [6, 11, 6, 11, 6],
                   })

r = (df.groupby(lambda x: x.split('_')[0], axis=1)
     .diff(axis=1)
     .dropna(how='all', axis=1) 
     .rename(columns = {'col1_df2':'diff1', 'col2_df2':'diff2', 'col3_df2':'diff3'}, inplace = False))

 
perc = float(eval('{0}/{1}'.format(r['diff1'].value_counts()[0], len(r['diff1']))))

print(per)
# 0.2 (ie 1 out of 5)

Pandas how do I compare columns for equality in a dataframe that have the same name except for a suffix?

Question

3 answers

solution1
2 2023-01-13 17:47:42

solution2
1 2023-01-14 01:53:04

solution3
0 2023-01-13 21:48:35

Pandas how do I compare columns for equality in a dataframe that have the same name except for a suffix?

Question

3 answers

solution1 2 2023-01-13 17:47:42

solution2 1 2023-01-14 01:53:04

solution3 0 2023-01-13 21:48:35

solution1
2 2023-01-13 17:47:42

solution2
1 2023-01-14 01:53:04

solution3
0 2023-01-13 21:48:35