简体   繁体   中英

BeautifulSoup and Python Remove HTML Tags

I need help parsing out HTML tags from the results of my script. I want to put the results in an object to convert to json. When I print the object, everything works fine except that I can't extract just the text without the html tags. I've been searching on this site for answers and tried various ways to remove the tags but I'm not sure what I'm doing wrong. I appreciate any help.

Based on some things I've read here, I tried printing teamObject.text but that doesn't work.

def make_soup(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup

soup = make_soup("team.html")
for record in soup.findAll('tr'):
    teamObject = {"name": record.find('a'),"description": record.find('p')}
    print (teamObject)

I expect to see the results in the object form without html tags.

Updating per comments:

The result I see currently just printing the code that I have above is:

{'name': <a href="/team/001"> Team 1 </a>, 'description': <p><a href="/team/001">Team 1</a> is a team does cool things.</p>}

Updating the code to include.text:

def make_soup(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup


soup = make_soup("team.html")
for record in soup.findAll('tr'):
    teamObject = {
        "name": record.find('a').text,
        "description": record.find('p').text
        }
    print (teamObject)

I get this result:

"name": record.find('a').text,
AttributeError: 'NoneType' object has no attribute 'text'

I expect to see just the text without html tags.

Try using .text on the find results of each record in your loop.

for record in soup.findAll('tr'):
    teamObject = {
        "name": record.find('a').text,
        "description": record.find('p').text
        }

.text calls .get_text() , so this is similar to the comment above, but I think you want to get the text results in your last search.

You could use get_text() if you need to pass in arguments for formatting. See the docs

Edit:

Receiving a NoneType error tells me you have some <tr> tags that don't include an <a> or <p> tag within. If record.find can't find a result in the loop, then it will return None, which can't return a text value.

You'd could solve this with logic or reevaluate how you're approaching the search. The hacky way would be to check if you have the tags you need before returning the text.

for record in soup.findAll('tr'):
    if record.a and record.p:
        teamObject = {
            "name": record.find('a').text,
            "description": record.find('p').text
            }

This ensures you won't receive the None error, but now you'll entirely skip any row that's missing either <a> or <p> tag, so beware.

If you're confident that relevant rows will always have <a> and <p> tags, you could focus your search by only returning rows with "Team" in it to exclude any bad <tr> entries.

for record in soup.select('tr:contains("Team")'):
    teamObject = {
        "name": record.find('a').text,
        "description": record.find('p').text
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM