简体   繁体   中英

searching similar columns names in multiple dataframe

I have multiple datasets which has same columns name as below example, I want the columns which are repeated in multiple datasets sort out in list format using python and pandas.

df1 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
               'B': 'one one two three two two one three'.split(),
               'C': np.arange(8), 
               'D': np.arange(8) * 2})
df2 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
               'B': 'one one two three two two one three'.split(),
               'C': np.arange(8)})
df3 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
               'B': 'one one two three two two one three'.split(),
               'D': np.arange(8) * 2})

As from above we can see in three Datasets df1, df2, df3 has repeated columns as 'A', 'B' and the output as ['A', 'B'] Please can give solution to this problem. Thanks in Advance

Pandas columns are of type pandas.core.indexes.base.Index you could use the intersection function in them to find the overlapping elements. Here is an example below

import pandas as pd
import numpy as np

a = np.arange(1,4)
b = np.arange(5,8)
c = np.random.randint(0,10,size=3)
d = np.random.randint(0,10,size=3)
df_1 = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})

out:

    a   b   c   d
0   1   5   5   1
1   2   6   7   5
2   3   7   6   9

a = np.arange(4,7)
b = np.arange(7,10)
e = np.random.randint(0,10,size=3)
f = np.random.randint(0,10,size=3)
df_2 = pd.DataFrame({'a':a,'b':b,'e':c,'f':d})
df_2

out:

    a   b   e   f
0   4   7   9   9
1   5   8   9   3
2   6   9   2   1

df_1.columns.intersection(df_2.columns)

out:

Index(['a', 'b'], dtype='object')

type(df_1.columns)

out:

pandas.core.indexes.base.Index

Pandas can get list of column names for you. For example, df1.columns will return ['A','B','C','D'] . Likewise you can get the list of column names for each dataframe.

Then you can just find out the intersection of all these lists .

I think simpliest is & for intersection of all columns names:

a = df1.columns & df2.columns & df3.columns
print (a)
Index(['A', 'B'], dtype='object')

If need list :

a = (df1.columns & df2.columns & df3.columns).tolist()
print (a)
['A', 'B']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM