Python Pandas str.contains() with hyperlinks in rows

Question

I have two pandas dataframes like so:

df1

site	link
retailer_site1	https://www.retailer_site1.com
...	...
retailer_siteX	https://www.retailer_siteX.com

df2

site	link
retailer_site1	https://www.retailer_site1.com
...	...
retailer_siteY	https://www.retailer_siteY.com

So I want to go through df2 and find instances of links from df2 in df1. Here's my code:

    for row in df2['link'].astype(str):
        boolean_findings = df1['link'].str.contains(row)

When I print boolean_findings, I'm getting all false, which I know can't be true because I'm seeing matches locally on my excel files:

boolean_findings
False
False
...
False

What I want to know is why the hyperlink string text is not being matched with it's equivalent on the first df, and what can I do to match the sites.

Answer 1

" I took a look and noticed some websites have a ( and ) included in their links, which might be throwing off the links

It seems you need to only account for alphanumeric/underscore chars when comparing the links, you can use

df2["link"].str.replace(r'\W+','', regex=True).isin(
    df1["link"].str.replace(r'\W+','', regex=True))

The .str.replace(r'\\W+','', regex=True) part will remove any chars other than letters, diacritics, digits and connector punctuation (most common char is underscore among them) from the links.

Python Pandas str.contains() with hyperlinks in rows

Question

1 answers

solution1
1 2021-10-23 10:27:53

Python Pandas str.contains() with hyperlinks in rows

Question

1 answers

solution1 1 2021-10-23 10:27:53

solution1
1 2021-10-23 10:27:53