简体   繁体   中英

How can I add an html 'path' (tag) from BeautifulSoup as a class instance variable in python?

I'm trying to use BeautifulSoup to process html data pulled from websites online. I've created a class 'Websites', with a couple of functions that parse the html scripts based on instance variables like header, class, etc. for my target bit of text. eg

class Websites:

    def __init__(self, url, header, class_):
        self.url = url
        self.header = header
        self.class_ = class_

    def html(self):
        url = self.url
        webpage = urlopen(url)
        page_html = webpage.read()
        webpage.close()
        page_soup = bs(page_html, 'html.parser')
        return page_soup

It's been simple to convert those variables (header, class) to instance variables in the class, but there is one variable that I'm struggling to convert into a class instance variable. I believe in BeautifulSoup lingo it's referred to as the 'tag'. If I call the html function shown above on an instance of the class, I get a block of html text I can save as a variable (page_soup), to which I can add a tag, eg like this:

page_soup.div.h1.p

This specifies the exact part of the html script that I want to access. Is there any way I could modify the class init function displayed above so that it could take an input, eg:

amazon = Websites(url = 'Amazon.co.uk', tag = '.div.h1.p')

and use it as an instance variable in a class method, as self.tag?

Accessing a tag in that way is the same as using BeautifulSoup's find() function, which returns the first matching tag. So you could write your own function to emulate this approach as follows:

from bs4 import BeautifulSoup

def get_tag(tag, text_attr):
    for attr in text_attr.split('.'):
        if attr:
            tag = tag.find(attr)

    return tag


html = """<html><h2>test1</h2><div><h1>test2<p>display this</p></h1></div></html>"""
soup = BeautifulSoup(html, "html.parser")

print(soup.div.h1.p)
print(get_tag(soup, '.div.h1.p'))

This would display:

<p>display this</p>
<p>display this</p>

An alternative approach would be to use the .select() function, which returns a list of matching tags:

print(soup.select('div > h1 > p')[0])    

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM