简体   繁体   中英

Pandas - Remove Columns based on values in another dataframe columns

I have a dataframe in pandas called as df_A which in real-time has more than 100 columns.

And, I have another dataframe df_B in which two columns gives me what columns do I need from the df_A

A reproducible example has been given below,

import pandas as pd

d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555],'foo2':[110, 101, 222], 
     'bar2':[333, 444, 555],'foo3':[100, 111, 222], 
     'bar3':[333, 444, 555]}

df_A = pd.DataFrame(d)

d = {'ReqCol_A':['foo','foo2'], 
     'bar':[333, 444],'foo2':[100, 111], 
     'bar2':[333, 444],'ReqCol_B':['bar3', ''], 
     'bar3':[333, 444]}

df_b = pd.DataFrame(d)

As it can be seen df_b in the above example, the values under ReqCol_A and ReqCol_B is what I am trying to get from df_A

so, my expected output will have three columns from df_A . The three columns will foo foo2 and bar3.

df_C will be the expected output and it will look like

df_C
foo foo2 bar3
100 110  333
111 101  444
222 222  555

Please help me with this. I am struggling to get this.

Try using filter to get only those columns with 'ReqCol' then stack to get a list and filter the db_A dataframe:

df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]

Output:

   foo  bar3  foo2
0  100   333   100
1  111   444   111
2  222   555   222

Solution:

# retrieve all the unique elements from your df_b columns (ReqCol_A and ReqCol_B) let it also include nan and other unwanted features
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())

# Taking intersection with df_A column names and fetching the names which need to be targeted
target_features = set(df_A.columns) & features

# Get the Output
df_A.loc[:,target_features]

Performance comparison

Given method:

%%timeit
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())
target_features = set(df_A.columns) & features
df_A.loc[:,target_features]
875 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Second answer (using filter):

%%timeit 
df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]
2.14 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Clearly, the given method is much faster than other.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM