简体   繁体   English

Python'str.contains'函数未返回正确的值

[英]Python 'str.contains' function not returning correct values

I am trying to subset a dataframe using 'pandas' if the column matches a particular pattern. 如果列与特定模式匹配,我尝试使用“ pandas”对数据框进行子集化。 Below is a reproducible example for reference. 下面是可复制的示例,以供参考。

import pandas as pd

# Create Dataframe having 10 rows and 2 columns 'code' and 'URL'
df = pd.DataFrame({'code': [1,1,2,2,3,4,1,2,2,5],
                   'URL': ['www.abc.de','https://www.abc.fr/-de','www.abc.fr','www.abc.fr','www.abc.co.uk','www.abc.es','www.abc.de','www.abc.fr','www.abc.fr','www.abc.it']})

# Create new dataframe by filtering out all rows where the column 'code' is equal to 1
new_df = df[df['code'] == 1]

# Below is how the new dataframe looks like
print(new_df)
                      URL  code
0              www.abc.de     1
1  https://www.abc.fr/-de     1
6              www.abc.de     1

Below are the dtypes for reference. 以下是dtype供参考。

print(new_df.dtypes)
URL     object
code     int64
dtype: object

# Now I am trying to exclude all those rows where the 'URL' column does not have .de as the pattern. This should retain only the 2nd row in new_df from above output
new_df = new_df[~ new_df['URL'].str.contains(r".de", case = True)]

# Below is how the output looks like
print(new_df)
Empty DataFrame
Columns: [URL, code]
Index: []

Below are my questions. 以下是我的问题。 1) Why is the 'URL' column appearing first even though I defined the 'code' column first? 1)即使我先定义了'code'列,为什么'URL'列仍会首先出现?

2) What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de ? 2)当我尝试删除所有那些“ URL”列没有模式.de行时,我的代码有什么问题? In R, I would simply use the below code to get the desired result easily. 在R中,我只需使用以下代码即可轻松获得所需的结果。

new_df <- new_df[grep(".de",new_df$URL, fixed = TRUE, invert = TRUE), ]

Desired output should be as below. 所需的输出应如下所示。

# Desired output for new_df
                   URL  code
https://www.abc.fr/-de     1

Any guidance on this would be really appreciated. 任何对此的指导将不胜感激。

Why is the 'URL' column appearing first even though I defined the 'code' column first? 即使我先定义了“代码”列,为什么“ URL”列仍会首先出现?

This is a consequence of the fact that dictionaries are not ordered. 这是由于字典未排序的结果。 Columns are read in and created in any order, depending on the random hash initialization of the python interpreter. 根据python解释器的随机哈希初始化,以任何顺序读取和创建列。


What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de? 当我尝试删除“ URL”列中没有模式.de的所有行时,我的代码有什么问题?

You'd need to escape the . 您需要逃脱. , because that's a special regex meta-character. ,因为这是一个特殊的正则表达式元字符。

df[df.code.eq(1) & ~df.URL.str.contains(r'\.de$', case=True)]

                      URL  code
1  https://www.abc.fr/-de     1

This may not be succifient if de can be found anywhere after the TLD (and not at the very end). 如果可以在TLD之后的任何地方(而不是在最末端)找到de这可能不是很简单。 Here's a general solution addressing that limitation - 这是解决该限制的一般解决方案-

p = '''.*       # match anything, greedily  
       \.       # literal dot
       de       # "de"
       (?!.*    # negative lookahead
       \.       # literal dot (should not be found)
       )'''
df[df.code.eq(1) & ~df.URL.str.contains(p, case=True, flags=re.VERBOSE)]

                      URL  code
1  https://www.abc.fr/-de     1 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM