简体   繁体   English

熊猫-根据其他数据框列中的值删除列

[英]Pandas - Remove Columns based on values in another dataframe columns

I have a dataframe in pandas called as df_A which in real-time has more than 100 columns. 我在熊猫中有一个称为df_A的数据df_A ,该数据df_A实时具有100多个列。

And, I have another dataframe df_B in which two columns gives me what columns do I need from the df_A 而且,我还有另一个数据df_B ,其中两df_B我提供了df_A需要哪些列

A reproducible example has been given below, 下面给出了一个可重现的示例,

import pandas as pd

d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555],'foo2':[110, 101, 222], 
     'bar2':[333, 444, 555],'foo3':[100, 111, 222], 
     'bar3':[333, 444, 555]}

df_A = pd.DataFrame(d)

d = {'ReqCol_A':['foo','foo2'], 
     'bar':[333, 444],'foo2':[100, 111], 
     'bar2':[333, 444],'ReqCol_B':['bar3', ''], 
     'bar3':[333, 444]}

df_b = pd.DataFrame(d)

As it can be seen df_b in the above example, the values under ReqCol_A and ReqCol_B is what I am trying to get from df_A 从上面的示例中可以看到df_bReqCol_AReqCol_B下的值就是我试图从df_A获得的df_A

so, my expected output will have three columns from df_A . 因此,我的预期输出将包含来自df_A三列。 The three columns will foo foo2 and bar3. 这三列将是foo foo2和bar3。

df_C will be the expected output and it will look like df_C将是预期的输出,它看起来像

df_C
foo foo2 bar3
100 110  333
111 101  444
222 222  555

Please help me with this. 请帮我解决一下这个。 I am struggling to get this. 我正在努力做到这一点。

Try using filter to get only those columns with 'ReqCol' then stack to get a list and filter the db_A dataframe: 尝试使用filter仅获取带有“ ReqCol”的列,然后stack以获取列表并过滤db_A数据帧:

df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]

Output: 输出:

   foo  bar3  foo2
0  100   333   100
1  111   444   111
2  222   555   222

Solution: 解:

# retrieve all the unique elements from your df_b columns (ReqCol_A and ReqCol_B) let it also include nan and other unwanted features
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())

# Taking intersection with df_A column names and fetching the names which need to be targeted
target_features = set(df_A.columns) & features

# Get the Output
df_A.loc[:,target_features]

Performance comparison 性能比较

Given method: 给定方法:

%%timeit
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())
target_features = set(df_A.columns) & features
df_A.loc[:,target_features]
875 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Second answer (using filter): 第二个答案(使用过滤器):

%%timeit 
df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]
2.14 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Clearly, the given method is much faster than other. 显然,给定的方法比其他方法快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用熊猫删除/替换基于另一列的列值 - Remove/replace columns values based on another columns using pandas 子集根据另一个数据帧的值在多个列上进行pandas数据帧 - Subset pandas dataframe on multiple columns based on values from another dataframe 将值添加到基于另一个 dataframe 的 pandas dataframe 列 - adding values to pandas dataframe columns based on another dataframe 根据另一个数据帧将列添加到 Pandas 数据帧并将值设置为零 - Add columns to Pandas dataframe based on another dataframe and set values to zero 根据来自另一个数据框的数据将值分配给Pandas数据框中的列 - Assign values to columns in Pandas Dataframe based on data from another dataframe 如何根据另一列的值更改 Pandas DataFrame 中的值 - How to change values in a Pandas DataFrame based on values of another columns 根据来自另一个 DataFrame 的值更新 pandas 列中的值 - Update values in pandas columns based on values from another DataFrame 根据具有相似值的多列从熊猫数据框中删除行 - Remove rows from pandas dataframe based on multiple columns with similar values 根据两列的值删除数据帧pandas中的重复项 - Remove duplicates in dataframe pandas based on values of two columns 如何根据两列中的值删除 pandas dataframe 中的行? - How to remove rows in a pandas dataframe based on values in two columns?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM