简体   繁体   中英

Filtering a large Pandas DataFrame based on a list of strings in column names

Stack Overflow Family,

I have recently started learning Python and am using Pandas to handle some factory data. The csv file is essentially a large dataframe (1621 rows × 5633 columns). While I need all the rows as these are data of each unit, I need to filter many unwanted columns. I have identified a list of strings in these column names that I can use to find only the wanted columns, however, I am not able to figure out what a good logic here would be or any built in python functions.

dropna is not an option for me as some of these wanted columns have NA as values (for example test limit) dropna for columns with all NA is also not good enough as I will still end up with a large number of columns.

Looking for some guidance here. Thank you for your time.

EDIT: Given the time complexity of my previous solution, I came up with a way to uselist comprehension :

fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
app = ["app", "ban"]
new_list = [x for x in fruits if any(y in x for y in app)]

output: ['apple', 'banana']

This should only display the columns you need. In your case you just need to do:

my_strings = ["A", "B", ...]
new_list = [x for x in df.columns if any(y in x for y in my_strings)]
print(new_list)

If you know exactly the column names, what you could do is some thing like that:

unwanted_cols = ['col1', 'col4'] #list of unwanted cols names

df_cleaned = current_df.drop(unwanted_cols, axis=1)

# or 

current_df.drop(unwanted_cols, inplace=True, axis=1)

If you don't know exactly the columns names what you could do is first retrieveing all the columns

all_cols = current_df.columns.tolist()

and apply a regex on all of the columns names, to obtain all of the columns names that matches your list of string and apply the same code as above

You can drop columns from a dataframe by applying string contains with regular expression. Below is an example df.drop(df.columns[df.columns.str.contains('^abc')], axis=1)

If you have a list of valid columns you can just use df.filter(cols_subset, axis=1) to drop everything else. You could use a regex to also match substrings from your list in column names:

df.filter(regex='|'.join(cols_subset), axis=1)

Or you could match only columns starting with a substring from your list:

df.filter(regex='^('+'|'.join(cols_subset)+')', axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM