简体   繁体   中英

Filter a pandas dataframe on multiple columns for partial string match, using values from a dict

I need to filter a dataframe on multiple values from a dict

df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv')
filters_raw = {'continent': {'filterTerm': 'Asi', 'column': {'rowType': 'filter', 'key': 'continent', 'name': 'continent', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 60}}, 'gdpPercap': {'filterTerm': '9', 'column': {'rowType': 'filter', 'key': 'gdpPercap', 'name': 'gdpPercap', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 354}}, 'lifeExp': {'filterTerm': '4', 'column': {'rowType': 'filter', 'key': 'lifeExp', 'name': 'lifeExp', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 501}}, 'pop': {'filterTerm': '3', 'column': {'rowType': 'filter', 'key': 'pop', 'name': 'pop', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 648}}, 'year': {'filterTerm': '2007', 'column': {'rowType': 'filter', 'key': 'year', 'name': 'year', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 795}}, 'country': {'filterTerm': 'af', 'column': {'rowType': 'filter', 'key': 'country', 'name': 'country', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 207}}}
filters = {i:filters_raw[i]['filterTerm'] for i in filters_raw.keys()}

To use a dict to get exact matches I can do this Based on this answer( Filter a pandas dataframe using values from a dict ); ;

dff = df.loc[(df[list(filters)] == pd.Series(filters)).all(axis=1)]

But if I want to filter the same way, but not be limited to just exact matches but also get matches where value from dict is contained as a substring in dataframe. How would I do that?

The desired output is a dataframe with only the values that correspond to all the conditions simultaneously. With the filters above;

Dff
Asia Afghanistan 974.5803384 43.828 31889923 2007

Have a look at pandas.Series.str.contains where you can use a regular expression. There is also string handling functions that may be more tailored for what you need.

One solution can be using pd.Series.str.starstwith to find strings matching the ones in filters .

You can create a mask for those rows this way:

mask =  df.astype(str).apply(lambda x: x.str.lower()
        ).apply(lambda x: x.str.startswith(filters[x.name].lower()),
                axis=0).all(axis=1)

Basically, you convert the original dataframe to string and lower case and then go column by column checking wich elements start with the string in filter for that column (ie filters['continent'] ). Finally you set to true rows where all the cells contain the elements in filter

The result will be:

df[mask]

        country  year         pop continent  lifeExp   gdpPercap
11  Afghanistan  2007  31889923.0      Asia   43.828  974.580338

Hope it serves.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM