Python'str.contains'函數未返回正確的值

Question

如果列與特定模式匹配，我嘗試使用“ pandas”對數據框進行子集化。 下面是可復制的示例，以供參考。

import pandas as pd

# Create Dataframe having 10 rows and 2 columns 'code' and 'URL'
df = pd.DataFrame({'code': [1,1,2,2,3,4,1,2,2,5],
                   'URL': ['www.abc.de','https://www.abc.fr/-de','www.abc.fr','www.abc.fr','www.abc.co.uk','www.abc.es','www.abc.de','www.abc.fr','www.abc.fr','www.abc.it']})

# Create new dataframe by filtering out all rows where the column 'code' is equal to 1
new_df = df[df['code'] == 1]

# Below is how the new dataframe looks like
print(new_df)
                      URL  code
0              www.abc.de     1
1  https://www.abc.fr/-de     1
6              www.abc.de     1

以下是dtype供參考。

print(new_df.dtypes)
URL     object
code     int64
dtype: object

# Now I am trying to exclude all those rows where the 'URL' column does not have .de as the pattern. This should retain only the 2nd row in new_df from above output
new_df = new_df[~ new_df['URL'].str.contains(r".de", case = True)]

# Below is how the output looks like
print(new_df)
Empty DataFrame
Columns: [URL, code]
Index: []

以下是我的問題。 1）即使我先定義了'code'列，為什么'URL'列仍會首先出現？

2）當我嘗試刪除所有那些“ URL”列沒有模式.de行時，我的代碼有什么問題？ 在R中，我只需使用以下代碼即可輕松獲得所需的結果。

new_df <- new_df[grep(".de",new_df$URL, fixed = TRUE, invert = TRUE), ]

所需的輸出應如下所示。

# Desired output for new_df
                   URL  code
https://www.abc.fr/-de     1

任何對此的指導將不勝感激。

Answer 1

即使我先定義了“代碼”列，為什么“ URL”列仍會首先出現？

這是由於字典未排序的結果。 根據python解釋器的隨機哈希初始化，以任何順序讀取和創建列。

當我嘗試刪除“ URL”列中沒有模式.de的所有行時，我的代碼有什么問題？

您需要逃脫. ，因為這是一個特殊的正則表達式元字符。

df[df.code.eq(1) & ~df.URL.str.contains(r'\.de$', case=True)]

                      URL  code
1  https://www.abc.fr/-de     1

如果可以在TLD之后的任何地方（而不是在最末端）找到de這可能不是很簡單。 這是解決該限制的一般解決方案-

p = '''.*       # match anything, greedily  
       \.       # literal dot
       de       # "de"
       (?!.*    # negative lookahead
       \.       # literal dot (should not be found)
       )'''
df[df.code.eq(1) & ~df.URL.str.contains(p, case=True, flags=re.VERBOSE)]

                      URL  code
1  https://www.abc.fr/-de     1

Python'str.contains'函數未返回正確的值

問題描述

1 個解決方案

解決方案1
3 已采納 2018-01-19 07:38:14

Python&#39;str.contains&#39;函數未返回正確的值

問題描述

1 個解決方案

解決方案1 3 已采納 2018-01-19 07:38:14

Python'str.contains'函數未返回正確的值

解決方案1
3 已采納 2018-01-19 07:38:14