I have two pandas dataframes like so:
df1
site | link |
---|---|
retailer_site1 | https://www.retailer_site1.com |
... | ... |
retailer_siteX | https://www.retailer_siteX.com |
df2
site | link |
---|---|
retailer_site1 | https://www.retailer_site1.com |
... | ... |
retailer_siteY | https://www.retailer_siteY.com |
So I want to go through df2 and find instances of links from df2 in df1. Here's my code:
for row in df2['link'].astype(str):
boolean_findings = df1['link'].str.contains(row)
When I print boolean_findings, I'm getting all false, which I know can't be true because I'm seeing matches locally on my excel files:
boolean_findings |
---|
False |
False |
... |
False |
What I want to know is why the hyperlink string text is not being matched with it's equivalent on the first df, and what can I do to match the sites.
" I took a look and noticed some websites have a
(
and)
included in their links, which might be throwing off the links
It seems you need to only account for alphanumeric/underscore chars when comparing the links, you can use
df2["link"].str.replace(r'\W+','', regex=True).isin(
df1["link"].str.replace(r'\W+','', regex=True))
The .str.replace(r'\\W+','', regex=True)
part will remove any chars other than letters, diacritics, digits and connector punctuation (most common char is underscore among them) from the links.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.