使用dict的值过滤多列上的pandas数据框以实现部分字符串匹配

Question

I need to filter a dataframe on multiple values from a dict 我需要根据字典中的多个值过滤数据框

df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv')
filters_raw = {'continent': {'filterTerm': 'Asi', 'column': {'rowType': 'filter', 'key': 'continent', 'name': 'continent', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 60}}, 'gdpPercap': {'filterTerm': '9', 'column': {'rowType': 'filter', 'key': 'gdpPercap', 'name': 'gdpPercap', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 354}}, 'lifeExp': {'filterTerm': '4', 'column': {'rowType': 'filter', 'key': 'lifeExp', 'name': 'lifeExp', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 501}}, 'pop': {'filterTerm': '3', 'column': {'rowType': 'filter', 'key': 'pop', 'name': 'pop', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 648}}, 'year': {'filterTerm': '2007', 'column': {'rowType': 'filter', 'key': 'year', 'name': 'year', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 795}}, 'country': {'filterTerm': 'af', 'column': {'rowType': 'filter', 'key': 'country', 'name': 'country', 'editable': True, 'sortable': True, 'resizable': True, 'filterable': True, 'width': 147, 'left': 207}}}
filters = {i:filters_raw[i]['filterTerm'] for i in filters_raw.keys()}

To use a dict to get exact matches I can do this Based on this answer( Filter a pandas dataframe using values from a dict ); 要使用字典来获得完全匹配，我可以基于此答案（使用字典中的值过滤熊猫数据框）来做到这一点； ; ;

dff = df.loc[(df[list(filters)] == pd.Series(filters)).all(axis=1)]

But if I want to filter the same way, but not be limited to just exact matches but also get matches where value from dict is contained as a substring in dataframe. 但是，如果我想以相同的方式进行过滤，但不仅限于完全匹配，还可以获取匹配，其中dict中的值作为子字符串包含在数据帧中。 How would I do that? 我该怎么做？

The desired output is a dataframe with only the values that correspond to all the conditions simultaneously. 所需的输出是仅具有同时与所有条件对应的值的数据帧。 With the filters above; 使用上面的过滤器；

Dff
Asia Afghanistan 974.5803384 43.828 31889923 2007

Answer 1

Have a look at pandas.Series.str.contains where you can use a regular expression. 查看pandas.Series.str.contains ，您可以在其中使用正则表达式。 There is also string handling functions that may be more tailored for what you need. 还有一些字符串处理功能可能会针对您的需求进行量身定制。

Answer 2

One solution can be using pd.Series.str.starstwith to find strings matching the ones in filters . 一种解决方案是使用pd.Series.str.starstwith来查找与filters字符串匹配的字符串。

You can create a mask for those rows this way: 您可以通过以下方式为这些行创建掩码：

mask =  df.astype(str).apply(lambda x: x.str.lower()
        ).apply(lambda x: x.str.startswith(filters[x.name].lower()),
                axis=0).all(axis=1)

Basically, you convert the original dataframe to string and lower case and then go column by column checking wich elements start with the string in filter for that column (ie filters['continent'] ). 基本上，您将原始数据帧转换为字符串和小写字母，然后逐列检查其中的元素以该列的过滤器中的字符串开头（例如， filters['continent'] ）。 Finally you set to true rows where all the cells contain the elements in filter 最后，将所有单元格都包含在filter的元素设置为真行

The result will be: 结果将是：

df[mask]

        country  year         pop continent  lifeExp   gdpPercap
11  Afghanistan  2007  31889923.0      Asia   43.828  974.580338

Hope it serves. 希望它有用。

使用dict的值过滤多列上的pandas数据框以实现部分字符串匹配

问题描述

2 个解决方案

解决方案1
0 2018-10-28 14:14:27

解决方案2
0 已采纳 2018-10-28 19:50:26

使用dict的值过滤多列上的pandas数据框以实现部分字符串匹配

问题描述

2 个解决方案

解决方案1 0 2018-10-28 14:14:27

解决方案2 0 已采纳 2018-10-28 19:50:26

解决方案1
0 2018-10-28 14:14:27

解决方案2
0 已采纳 2018-10-28 19:50:26