简体   繁体   English

如何基于多列中的字符串匹配 Pandas dataframe 中的 select 行

[英]How to select rows in Pandas dataframe based on string matching in multiple columns

I don't think this exact question has been answered yet, so here goes.我认为这个确切的问题还没有得到回答,所以就这样吧。

I have a Pandas data frame, and I want to select all rows that contain a string in column A OR column B.我有一个 Pandas 数据框,我想 select 在 A 列或 B 列中包含字符串的所有行。

Say the dataframe looks like this:假设 dataframe 看起来像这样:

d = {'id':["1", "2", "3", "4"], 
    'title': ["Horses are good", "Cats are bad", "Frogs are nice", "Turkeys are the best"], 
    'description':["Horse epitome", "Cats bad but horses good", "Frog fancier", "Turkey tome, not about horses"],
   'tags':["horse, cat, frog, turkey", "horse, cat, frog, turkey", "horse, cat, frog, turkey", "horse, cat, frog, turkey"],
   'date':["2019-01-01", "2019-10-01", "2018-08-14", "2016-11-29"]}

dataframe  = pandas.DataFrame(d)

Which gives:这使:

id              title                      description               tag           date
1   "Horses are good"                  "Horse epitome"       "horse, cat"    2019-01-01
2      "Cats are bad"                       "Cats bad"       "horse, cat"    2019-10-01
3    "Frogs are nice"      "Frog fancier, horses good"      "horse, frog"    2018-08-14
4   "Turkey are best"                    "Turkey tome"    "turkey, horse"    2016-11-29

Let's say I want to create a new dataframe containing rows with the string horse (ignoring capitalisation) in the column title OR the column description , but not in the column tag (or any other column).假设我想创建一个新的 dataframe ,其中包含列title或列description中带有字符串horse (忽略大写)的行,但不在列tag (或任何其他列)中。

The result should be (row 2 and 4 get dropped):结果应该是(第 2 行和第 4 行被删除):

id                title                     description                 tag          date  
1     "Horses are good"                  "Horse epitome"       "horse, cat"    2019-01-01
3      "Frogs are nice"      "Frog fancier, horses good"      "horse, frog"    2018-08-14

I have seen a few answers for one column, such as something like:我在一篇专栏中看到了一些答案,例如:

dataframe[dataframe['title'].str.contains('horse')]

But I am not sure (1) how to add multiple columns to this statement and (2) how to modify it with something like string.lower() to remove capitals in the column values for the string match.但我不确定(1)如何在该语句中添加多个列,以及(2)如何使用类似string.lower()的方法修改它以删除字符串匹配的列值中的大写字母。

Thanks in advance!提前致谢!

If want specify columns for test one possible solution is join all columns and then test with Series.str.contains and case=False :如果要指定列进行测试,一种可能的解决方案是连接所有列,然后使用Series.str.containscase=False进行测试:

s = dataframe['title'] + dataframe['description']
df = dataframe[s.str.contains('horse', case=False)]

Or create conditions for each column and chain them by bitwise OR with |或者为每列创建条件,并通过按位OR将它们链接到| :

df = dataframe[dataframe['title'].str.contains('horse', case=False) | 
               dataframe['description'].str.contains('horse', case=False)]

Also if want specify column column for not test chain solution with bitwise AND with invert condition by ~ for NOT MATCH :此外,如果要指定列列用于不使用按位AND反转条件测试链解决方案的~NOT MATCH

df = dataframe[s.str.contains('horse', case=False) &
               ~dataframe['tags'].str.contains('horse', case=False)]

For second solution add () around all columns with chained by OR :对于第二个解决方案,在所有由OR链接的列周围添加()

df = dataframe[(dataframe['title'].str.contains('horse', case=False) | 
               dataframe['description'].str.contains('horse', case=False)) &
              ~dataframe['tags'].str.contains('horse', case=False)]]

EDIT:编辑:

Like @WeNYoBen commented you can add DataFrame.copy to end for prevent SettingWithCopyWarning like:就像@WeNYoBen 评论的那样,您可以将DataFrame.copy添加到 end 以防止SettingWithCopyWarning像:

s = dataframe['title'] + dataframe['description']
df = dataframe[s.str.contains('horse', case=False)].copy()

You can use a "logical or" operator |您可以使用“逻辑或”运算符| on the series corresponding to each column:在每列对应的系列上:

filtered = df[df['title'].str.contains('horse', case=False) | 
              df['description'].str.contains('horse', case=False)]

If you have many columns, you could use a reduce operation:如果你有很多列,你可以使用 reduce 操作:

import functools
import operator

colnames = ['title', 'description']
mask = functools.reduce(operator.or_, (df[col].str.contains('horse', case=False) for col in colnames))
filtered = df[mask]    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM