Python 'str.contains' function not returning correct values

Question

I am trying to subset a dataframe using 'pandas' if the column matches a particular pattern. Below is a reproducible example for reference.

import pandas as pd

# Create Dataframe having 10 rows and 2 columns 'code' and 'URL'
df = pd.DataFrame({'code': [1,1,2,2,3,4,1,2,2,5],
                   'URL': ['www.abc.de','https://www.abc.fr/-de','www.abc.fr','www.abc.fr','www.abc.co.uk','www.abc.es','www.abc.de','www.abc.fr','www.abc.fr','www.abc.it']})

# Create new dataframe by filtering out all rows where the column 'code' is equal to 1
new_df = df[df['code'] == 1]

# Below is how the new dataframe looks like
print(new_df)
                      URL  code
0              www.abc.de     1
1  https://www.abc.fr/-de     1
6              www.abc.de     1

Below are the dtypes for reference.

print(new_df.dtypes)
URL     object
code     int64
dtype: object

# Now I am trying to exclude all those rows where the 'URL' column does not have .de as the pattern. This should retain only the 2nd row in new_df from above output
new_df = new_df[~ new_df['URL'].str.contains(r".de", case = True)]

# Below is how the output looks like
print(new_df)
Empty DataFrame
Columns: [URL, code]
Index: []

Below are my questions. 1) Why is the 'URL' column appearing first even though I defined the 'code' column first?

2) What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de ? In R, I would simply use the below code to get the desired result easily.

new_df <- new_df[grep(".de",new_df$URL, fixed = TRUE, invert = TRUE), ]

Desired output should be as below.

# Desired output for new_df
                   URL  code
https://www.abc.fr/-de     1

Any guidance on this would be really appreciated.

Answer 1

Why is the 'URL' column appearing first even though I defined the 'code' column first?

This is a consequence of the fact that dictionaries are not ordered. Columns are read in and created in any order, depending on the random hash initialization of the python interpreter.

What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de?

You'd need to escape the . , because that's a special regex meta-character.

df[df.code.eq(1) & ~df.URL.str.contains(r'\.de$', case=True)]

                      URL  code
1  https://www.abc.fr/-de     1

This may not be succifient if de can be found anywhere after the TLD (and not at the very end). Here's a general solution addressing that limitation -

p = '''.*       # match anything, greedily  
       \.       # literal dot
       de       # "de"
       (?!.*    # negative lookahead
       \.       # literal dot (should not be found)
       )'''
df[df.code.eq(1) & ~df.URL.str.contains(p, case=True, flags=re.VERBOSE)]

                      URL  code
1  https://www.abc.fr/-de     1

Python 'str.contains' function not returning correct values

Question

1 answers

solution1
3 ACCPTED 2018-01-19 07:38:14

Python 'str.contains' function not returning correct values

Question

1 answers

solution1 3 ACCPTED 2018-01-19 07:38:14

solution1
3 ACCPTED 2018-01-19 07:38:14