I have a Pandas dataframe with 100 columns. I want to perform an operation that compares all of the possible column combinations to each other (col 1 vs col2, col 1 vs. col3, [...] col 99 vs. col 100).
For example:
colA colB colC colD
1 1 2 1
so for example a comparison of equal value between two values should yield yes
for colA vs. colB
and no
for colA vs. colC.
Ideally, I would like to only make unique comparisons so colA vs. colB
is equal to colB vs. colA
and only one value should be retained.
Is there any efficient way to do it?
The 1st thing I would do is set the comparison command for example
(df['col1'] == df['col2']).any()
what we need is the combinations of all columns
from itertools import combinations
combs = list(combinations(df.columns, 2))
now we can loop through them and compare them, using our single row from the top
for cmb in combs:
print((df[cmb[0]] == df[cmb[1]]).any())
import itertools
from scipy.spatial.distance import pdist
pd.Series(pdist(df.T)==0, index=itertools.combinations(df.columns, 2))
output:
(colA, colB) True
(colA, colC) False
(colA, colD) True
(colB, colC) False
(colB, colD) True
(colC, colD) False
alternative as matrix:
import itertools
from scipy.spatial.distance import pdist, squareform
pd.DataFrame(squareform(pdist(df.T)) == 0, index=df.columns, columns=df.columns)
output:
colA colB colC colD
colA True True False True
colB True True False True
colC False False True False
colD True True False True
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.