熊猫-根据其他数据框列中的值删除列

Question

I have a dataframe in pandas called as df_A which in real-time has more than 100 columns. 我在熊猫中有一个称为df_A的数据df_A ，该数据df_A实时具有100多个列。

And, I have another dataframe df_B in which two columns gives me what columns do I need from the df_A 而且，我还有另一个数据df_B ，其中两df_B我提供了df_A需要哪些列

A reproducible example has been given below, 下面给出了一个可重现的示例，

import pandas as pd

d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555],'foo2':[110, 101, 222], 
     'bar2':[333, 444, 555],'foo3':[100, 111, 222], 
     'bar3':[333, 444, 555]}

df_A = pd.DataFrame(d)

d = {'ReqCol_A':['foo','foo2'], 
     'bar':[333, 444],'foo2':[100, 111], 
     'bar2':[333, 444],'ReqCol_B':['bar3', ''], 
     'bar3':[333, 444]}

df_b = pd.DataFrame(d)

As it can be seen df_b in the above example, the values under ReqCol_A and ReqCol_B is what I am trying to get from df_A 从上面的示例中可以看到df_b ， ReqCol_A和ReqCol_B下的值就是我试图从df_A获得的df_A

so, my expected output will have three columns from df_A . 因此，我的预期输出将包含来自df_A三列。 The three columns will foo foo2 and bar3. 这三列将是foo foo2和bar3。

df_C will be the expected output and it will look like df_C将是预期的输出，它看起来像

df_C
foo foo2 bar3
100 110  333
111 101  444
222 222  555

Please help me with this. 请帮我解决一下这个。 I am struggling to get this. 我正在努力做到这一点。

Answer 1

Try using filter to get only those columns with 'ReqCol' then stack to get a list and filter the db_A dataframe: 尝试使用filter仅获取带有“ ReqCol”的列，然后stack以获取列表并过滤db_A数据帧：

df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]

Output: 输出：

   foo  bar3  foo2
0  100   333   100
1  111   444   111
2  222   555   222

Answer 2

Solution: 解：

# retrieve all the unique elements from your df_b columns (ReqCol_A and ReqCol_B) let it also include nan and other unwanted features
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())

# Taking intersection with df_A column names and fetching the names which need to be targeted
target_features = set(df_A.columns) & features

# Get the Output
df_A.loc[:,target_features]

Performance comparison 性能比较

Given method: 给定方法：

%%timeit
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())
target_features = set(df_A.columns) & features
df_A.loc[:,target_features]
875 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Second answer (using filter): 第二个答案（使用过滤器）：

%%timeit 
df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]
2.14 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Clearly, the given method is much faster than other. 显然，给定的方法比其他方法快得多。

熊猫-根据其他数据框列中的值删除列

问题描述

2 个解决方案

解决方案1
3 2019-01-17 19:53:09

解决方案2
2 已采纳 2019-01-17 19:52:12

熊猫-根据其他数据框列中的值删除列

问题描述

2 个解决方案

解决方案1 3 2019-01-17 19:53:09

解决方案2 2 已采纳 2019-01-17 19:52:12

解决方案1
3 2019-01-17 19:53:09

解决方案2
2 已采纳 2019-01-17 19:52:12