I'm trying to use BeautifulSoup to process html data pulled from websites online. I've created a class 'Websites', with a couple of functions that parse the html scripts based on instance variables like header, class, etc. for my target bit of text. eg
class Websites:
def __init__(self, url, header, class_):
self.url = url
self.header = header
self.class_ = class_
def html(self):
url = self.url
webpage = urlopen(url)
page_html = webpage.read()
webpage.close()
page_soup = bs(page_html, 'html.parser')
return page_soup
It's been simple to convert those variables (header, class) to instance variables in the class, but there is one variable that I'm struggling to convert into a class instance variable. I believe in BeautifulSoup lingo it's referred to as the 'tag'. If I call the html function shown above on an instance of the class, I get a block of html text I can save as a variable (page_soup), to which I can add a tag, eg like this:
page_soup.div.h1.p
This specifies the exact part of the html script that I want to access. Is there any way I could modify the class init function displayed above so that it could take an input, eg:
amazon = Websites(url = 'Amazon.co.uk', tag = '.div.h1.p')
and use it as an instance variable in a class method, as self.tag?
Accessing a tag in that way is the same as using BeautifulSoup's find()
function, which returns the first matching tag. So you could write your own function to emulate this approach as follows:
from bs4 import BeautifulSoup
def get_tag(tag, text_attr):
for attr in text_attr.split('.'):
if attr:
tag = tag.find(attr)
return tag
html = """<html><h2>test1</h2><div><h1>test2<p>display this</p></h1></div></html>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.div.h1.p)
print(get_tag(soup, '.div.h1.p'))
This would display:
<p>display this</p>
<p>display this</p>
An alternative approach would be to use the .select()
function, which returns a list of matching tags:
print(soup.select('div > h1 > p')[0])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.