如何按条件删除 pandas dataframe 中字符串值的行？

Question

Consider a Pandas Dataframe like:考虑一个 Pandas Dataframe 像：

>>> import pandas as pd
>>> df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com', 'http://www.url2.com','http://www.url3.com','http://www.url1.com']))
>>> df

Giving:给予：

                   url
0      http://url1.com
1  http://www.url1.com
2  http://www.url2.com
3  http://www.url3.com
4  http://www.url1.com

I want to remove all rows containing url1.com and url2.com to obtain dataframe result like:我想删除所有包含url1.com和url2.com的行以获得 dataframe 结果，例如：

                   url
0   http://ww.url3.com

I do this我这样做

domainToCheck = ('url1.com', 'url2.com')
goodUrl = df['url'].apply(lambda x : any(domain in x for domain in domainToCheck))

But this give me no result.但这没有给我任何结果。

Any idea how to solve the above problem?知道如何解决上述问题吗？

Edit: Solution编辑：解决方案

import pandas as pd
import tldextract

df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com','http://www.url2.com','http://www.url3.com','http://www.url1.com']))
domainToCheck = ['url1', 'url2']
s = df.url.map(lambda x : tldextract.extract(x).domain).isin(domainToCheck)
df = df[~s].reset_index(drop=True)

Answer 1

If we checking domain , we should find the 100% match domain rather than use string contain.如果我们检查domain ，我们应该找到 100% 匹配的 domain 而不是使用字符串包含。 since the subdomain may contain the same key work as domain因为subdomain可能包含与域相同的关键工作

import tldextract

s=df.url.map(lambda x : tldextract.extract(x).domain).isin(['url1','url2'])
Out[594]: 
0     True
1     True
2     True
3    False
4     True
Name: url, dtype: bool

df=df[~s]

Answer 2

you can use pd.Series.str.contains here.你可以在这里使用pd.Series.str.contains 。

df[~df.url.str.contains('|'.join(domainToCheck))]

                   url
3  http://www.url3.com

If you want to reset index use this如果要重置索引，请使用此

df[~df.url.str.contains('|'.join(domainToCheck))].reset_index(drop=True)

                   url
0  http://www.url3.com

Answer 3

Use, Series.str.contains to create a boolean mask m and then you can filter the dataframe df using this boolean mask:使用Series.str.contains创建 boolean 掩码m然后您可以使用此 boolean 掩码过滤 dataframe df ：

m = df['url'].str.contains('|'.join(domainToCheck))
df = df[~m].reset_index(drop=True)

Result:结果：

                   url
0  http://www.url3.com

如何按条件删除 pandas dataframe 中字符串值的行？

问题描述

Edit: Solution编辑：解决方案

3 个解决方案

解决方案1
2 已采纳 2020-05-29 15:44:34

解决方案2
1 2020-05-29 15:38:22

解决方案3
1 2020-05-29 15:40:50

如何按条件删除 pandas dataframe 中字符串值的行？

问题描述

Edit: Solution编辑：解决方案

3 个解决方案

解决方案1 2 已采纳 2020-05-29 15:44:34

解决方案2 1 2020-05-29 15:38:22

解决方案3 1 2020-05-29 15:40:50

解决方案1
2 已采纳 2020-05-29 15:44:34

解决方案2
1 2020-05-29 15:38:22

解决方案3
1 2020-05-29 15:40:50