简体   繁体   English

如何按条件删除 pandas dataframe 中字符串值的行?

[英]How to drop rows by condition on string value in pandas dataframe?

Consider a Pandas Dataframe like:考虑一个 Pandas Dataframe 像:

>>> import pandas as pd
>>> df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com', 'http://www.url2.com','http://www.url3.com','http://www.url1.com']))
>>> df

Giving:给予:

                   url
0      http://url1.com
1  http://www.url1.com
2  http://www.url2.com
3  http://www.url3.com
4  http://www.url1.com

I want to remove all rows containing url1.com and url2.com to obtain dataframe result like:我想删除所有包含url1.comurl2.com的行以获得 dataframe 结果,例如:

                   url
0   http://ww.url3.com

I do this我这样做

domainToCheck = ('url1.com', 'url2.com')
goodUrl = df['url'].apply(lambda x : any(domain in x for domain in domainToCheck))

But this give me no result.但这没有给我任何结果。

Any idea how to solve the above problem?知道如何解决上述问题吗?

Edit: Solution编辑:解决方案

import pandas as pd
import tldextract

df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com','http://www.url2.com','http://www.url3.com','http://www.url1.com']))
domainToCheck = ['url1', 'url2']
s = df.url.map(lambda x : tldextract.extract(x).domain).isin(domainToCheck)
df = df[~s].reset_index(drop=True)

If we checking domain , we should find the 100% match domain rather than use string contain.如果我们检查domain ,我们应该找到 100% 匹配的 domain 而不是使用字符串包含。 since the subdomain may contain the same key work as domain因为subdomain可能包含与域相同的关键工作

import tldextract

s=df.url.map(lambda x : tldextract.extract(x).domain).isin(['url1','url2'])
Out[594]: 
0     True
1     True
2     True
3    False
4     True
Name: url, dtype: bool

df=df[~s]

you can use pd.Series.str.contains here.你可以在这里使用pd.Series.str.contains

df[~df.url.str.contains('|'.join(domainToCheck))]

                   url
3  http://www.url3.com

If you want to reset index use this如果要重置索引,请使用此

df[~df.url.str.contains('|'.join(domainToCheck))].reset_index(drop=True)

                   url
0  http://www.url3.com

Use, Series.str.contains to create a boolean mask m and then you can filter the dataframe df using this boolean mask:使用Series.str.contains创建 boolean 掩码m然后您可以使用此 boolean 掩码过滤 dataframe df

m = df['url'].str.contains('|'.join(domainToCheck))
df = df[~m].reset_index(drop=True)

Result:结果:

                   url
0  http://www.url3.com

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM