[英]Pandas - Remove Columns based on values in another dataframe columns
I have a dataframe in pandas called as df_A
which in real-time has more than 100 columns. 我在熊猫中有一个称为df_A
的数据df_A
,该数据df_A
实时具有100多个列。
And, I have another dataframe df_B
in which two columns gives me what columns do I need from the df_A
而且,我还有另一个数据df_B
,其中两df_B
我提供了df_A
需要哪些列
A reproducible example has been given below, 下面给出了一个可重现的示例,
import pandas as pd
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555],'foo2':[110, 101, 222],
'bar2':[333, 444, 555],'foo3':[100, 111, 222],
'bar3':[333, 444, 555]}
df_A = pd.DataFrame(d)
d = {'ReqCol_A':['foo','foo2'],
'bar':[333, 444],'foo2':[100, 111],
'bar2':[333, 444],'ReqCol_B':['bar3', ''],
'bar3':[333, 444]}
df_b = pd.DataFrame(d)
As it can be seen df_b
in the above example, the values under ReqCol_A
and ReqCol_B
is what I am trying to get from df_A
从上面的示例中可以看到df_b
, ReqCol_A
和ReqCol_B
下的值就是我试图从df_A
获得的df_A
so, my expected output will have three columns from df_A
. 因此,我的预期输出将包含来自df_A
三列。 The three columns will foo foo2 and bar3. 这三列将是foo foo2和bar3。
df_C
will be the expected output and it will look like df_C
将是预期的输出,它看起来像
df_C
foo foo2 bar3
100 110 333
111 101 444
222 222 555
Please help me with this. 请帮我解决一下这个。 I am struggling to get this. 我正在努力做到这一点。
Try using filter
to get only those columns with 'ReqCol' then stack
to get a list and filter the db_A dataframe: 尝试使用filter
仅获取带有“ ReqCol”的列,然后stack
以获取列表并过滤db_A数据帧:
df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]
Output: 输出:
foo bar3 foo2
0 100 333 100
1 111 444 111
2 222 555 222
Solution: 解:
# retrieve all the unique elements from your df_b columns (ReqCol_A and ReqCol_B) let it also include nan and other unwanted features
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())
# Taking intersection with df_A column names and fetching the names which need to be targeted
target_features = set(df_A.columns) & features
# Get the Output
df_A.loc[:,target_features]
Performance comparison 性能比较
Given method: 给定方法:
%%timeit
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())
target_features = set(df_A.columns) & features
df_A.loc[:,target_features]
875 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Second answer (using filter): 第二个答案(使用过滤器):
%%timeit
df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]
2.14 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Clearly, the given method is much faster than other. 显然,给定的方法比其他方法快得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.