简体   繁体   中英

Python Pandas str.contains() with hyperlinks in rows

I have two pandas dataframes like so:

df1

site link
retailer_site1 https://www.retailer_site1.com
... ...
retailer_siteX https://www.retailer_siteX.com

df2

site link
retailer_site1 https://www.retailer_site1.com
... ...
retailer_siteY https://www.retailer_siteY.com

So I want to go through df2 and find instances of links from df2 in df1. Here's my code:

    for row in df2['link'].astype(str):
        boolean_findings = df1['link'].str.contains(row)

When I print boolean_findings, I'm getting all false, which I know can't be true because I'm seeing matches locally on my excel files:

boolean_findings
False
False
...
False

What I want to know is why the hyperlink string text is not being matched with it's equivalent on the first df, and what can I do to match the sites.

" I took a look and noticed some websites have a ( and ) included in their links, which might be throwing off the links

It seems you need to only account for alphanumeric/underscore chars when comparing the links, you can use

df2["link"].str.replace(r'\W+','', regex=True).isin(
    df1["link"].str.replace(r'\W+','', regex=True))

The .str.replace(r'\\W+','', regex=True) part will remove any chars other than letters, diacritics, digits and connector punctuation (most common char is underscore among them) from the links.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM