How to isolate part of a link in BS4?

Question

As a starter project I'm using BS4 to identify sites that use WordPress.

I'm having trouble getting the identifier right. I know that WordPress sites have /wp-content/ links in the html. But I'm not isolating them correctly.

Here are some link examples:

img src="https://variety.com/wp-content/...
href="https://variety.com/wp-content/...

I've been playing with lots of variations on:

find_wordpress = soup.find('a', href = "wp-content")

But I'm not getting it right. The domain will change so I just need to isolate the /wp-content/ part.

Any suggestions? Thanks!

Answer 1

If you only what to identify WordPress then maybe you could use

"/wp-content/" in html

but sometimes it can be misleading if /wp-content/ is used in some other text.

html = '''<a href="http://one.com/wp-content/1"></a>'''

result = ('/wp-content/' in html1)
print('result 1:', result)

html = '''<a href="http://one.com/2"></a>'''

result = ('/wp-content/' in html)
print('result 2:', result)

If you need to check href then you can use regex

soup.find('a', href=re.compile(r'.*/wp-content/.*'))

or even

soup.find('a', href=re.compile(r'/wp-content/'))

Or you can use function

def test_link(link):
    return '/wp-content/' in link

result = soup.find('a', href=test_link)

or the same with lambda

soup.find('a', href=lambda link:'/wp-content/' in link)

from bs4 import BeautifulSoup

html1 = '''<a href="http://one.com/wp-content/1"></a>'''
html2 = '''<a href="http://one.com/2"></a>'''

result = ('/wp-content/' in html1)
print('1:', result)

result = ('/wp-content/' in html2)
print('2:', result)


soup1 = BeautifulSoup(html1, 'lxml')
soup2 = BeautifulSoup(html2, 'lxml')

import re

result = soup1.find('a', href=re.compile(r'/wp-content/'))
print('1:', result, '-->', (result is not None))
result = soup2.find('a', href=re.compile(r'/wp-content/'))
print('2:', result, '-->', (result is not None))

#def test(link):
#    return '/wp-content/' in link

result = soup1.find('a', href=lambda link:'/wp-content/' in link)
print('1:', result, '-->', (result is not None))
result = soup2.find('a', href=lambda link:'/wp-content/' in link) 
print('2:', result, '-->', (result is not None))

How to isolate part of a link in BS4?

Question

1 answers

solution1
1 ACCPTED 2020-09-11 05:41:59

How to isolate part of a link in BS4?

Question

1 answers

solution1 1 ACCPTED 2020-09-11 05:41:59

solution1
1 ACCPTED 2020-09-11 05:41:59