I have a dataframe in pandas called as df_A
which in real-time has more than 100 columns.
And, I have another dataframe df_B
in which two columns gives me what columns do I need from the df_A
A reproducible example has been given below,
import pandas as pd
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555],'foo2':[110, 101, 222],
'bar2':[333, 444, 555],'foo3':[100, 111, 222],
'bar3':[333, 444, 555]}
df_A = pd.DataFrame(d)
d = {'ReqCol_A':['foo','foo2'],
'bar':[333, 444],'foo2':[100, 111],
'bar2':[333, 444],'ReqCol_B':['bar3', ''],
'bar3':[333, 444]}
df_b = pd.DataFrame(d)
As it can be seen df_b
in the above example, the values under ReqCol_A
and ReqCol_B
is what I am trying to get from df_A
so, my expected output will have three columns from df_A
. The three columns will foo foo2 and bar3.
df_C
will be the expected output and it will look like
df_C
foo foo2 bar3
100 110 333
111 101 444
222 222 555
Please help me with this. I am struggling to get this.
Try using filter
to get only those columns with 'ReqCol' then stack
to get a list and filter the db_A dataframe:
df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]
Output:
foo bar3 foo2
0 100 333 100
1 111 444 111
2 222 555 222
Solution:
# retrieve all the unique elements from your df_b columns (ReqCol_A and ReqCol_B) let it also include nan and other unwanted features
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())
# Taking intersection with df_A column names and fetching the names which need to be targeted
target_features = set(df_A.columns) & features
# Get the Output
df_A.loc[:,target_features]
Performance comparison
Given method:
%%timeit
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())
target_features = set(df_A.columns) & features
df_A.loc[:,target_features]
875 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Second answer (using filter):
%%timeit
df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]
2.14 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Clearly, the given method is much faster than other.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.