As a starter project I'm using BS4 to identify sites that use WordPress.
I'm having trouble getting the identifier right. I know that WordPress sites have /wp-content/ links in the html. But I'm not isolating them correctly.
Here are some link examples:
img src="https://variety.com/wp-content/...
href="https://variety.com/wp-content/...
I've been playing with lots of variations on:
find_wordpress = soup.find('a', href = "wp-content")
But I'm not getting it right. The domain will change so I just need to isolate the /wp-content/ part.
Any suggestions? Thanks!
If you only what to identify WordPress
then maybe you could use
"/wp-content/" in html
but sometimes it can be misleading if /wp-content/
is used in some other text.
html = '''<a href="http://one.com/wp-content/1"></a>'''
result = ('/wp-content/' in html1)
print('result 1:', result)
html = '''<a href="http://one.com/2"></a>'''
result = ('/wp-content/' in html)
print('result 2:', result)
If you need to check href
then you can use regex
soup.find('a', href=re.compile(r'.*/wp-content/.*'))
or even
soup.find('a', href=re.compile(r'/wp-content/'))
Or you can use function
def test_link(link):
return '/wp-content/' in link
result = soup.find('a', href=test_link)
or the same with lambda
soup.find('a', href=lambda link:'/wp-content/' in link)
from bs4 import BeautifulSoup
html1 = '''<a href="http://one.com/wp-content/1"></a>'''
html2 = '''<a href="http://one.com/2"></a>'''
result = ('/wp-content/' in html1)
print('1:', result)
result = ('/wp-content/' in html2)
print('2:', result)
soup1 = BeautifulSoup(html1, 'lxml')
soup2 = BeautifulSoup(html2, 'lxml')
import re
result = soup1.find('a', href=re.compile(r'/wp-content/'))
print('1:', result, '-->', (result is not None))
result = soup2.find('a', href=re.compile(r'/wp-content/'))
print('2:', result, '-->', (result is not None))
#def test(link):
# return '/wp-content/' in link
result = soup1.find('a', href=lambda link:'/wp-content/' in link)
print('1:', result, '-->', (result is not None))
result = soup2.find('a', href=lambda link:'/wp-content/' in link)
print('2:', result, '-->', (result is not None))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.