简体   繁体   中英

How to remove html tags from text using python?

I am new to using python and I am trying to create a simple script that prints out the word of the day from Urban Dictionary.

    import requests
    from bs4 import BeautifulSoup

    # requests urban dictionary home page 
    r = requests.get('https://www.urbandictionary.com')

    soup = BeautifulSoup(r.text, 'html.parser')

    # finds the title
    title = soup.find('title').text

    print(title)

    # finds the definition
    definition = soup.find('meta', attrs={'property': 'og:description'})

    print(definition)

I use ".text" for the title to get rid of the html tags and it works, but when I try to use it on the definition all of the text disappears. So, at the moment definition prints out with the html tags. What are some other ways besides ".text" to remove the html tags. When I try to paste the output here part of it doesn't show up so here is a picture of the output .

This is my first time posting on here so I'm sorry if I didn't format my question correctly but any help would be greatly appreciated.

... when I try to use [the text property] on the definition all of the text disappears...

This is because the tag you're targeting looks like this:

<meta content="foo bar baz..." name="Description" property="og:description">

When you try to access the text property on this object in Beautiful Soup, there isn't any text that's a child of the element. Instead, you're looking to extract the "content" attribute, which you can do with the square bracket "array"-style notation:

definition['content']

This feature is documented in the Attributes section of the Beautiful Soup documentation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM