简体   繁体   中英

Beautiful Soup HTML parsing anamoly

I am trying to get the text out of certain class in a HTML using beautiful soup. I have successfully got the texts but, there are some anomalies(unrecognisable characters) in it like shown in the image below. How can I solve it with a python code instead of manually deleting these anomalies. 在此处输入图片说明

Code:

    try:
        html =requests.get(url)
    except:
        print("no conection")
    try:
        soup = BS(html.text,'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

When you access html.text , Requests tries to determine the character encoding so it can properly decode the raw bytes it received from the server. The content-type header that timesofindia sent is text/html; charset=iso-8859-1 text/html; charset=iso-8859-1 , which is what Requests went with. The character encoding is almost certainly utf-8 .

You can fix this by either setting the encoding of html to utf-8 before accessing html.text :

    try:
        html =requests.get(url)
        html.encoding = 'utf-8'
    except:
        print("no conection")
    try:
        soup = BS(html.text,'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

or decode html.content as utf-8 , and pass that into BS instead of html.text :

    try:
        html =requests.get(url)
    except:
        print("no conection")
    try:
        soup = BS(html.content.decode('utf-8'),'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

I would highly recommend you learn about character encoding and Unicode. It is very easy to get tripped up by it. We've all been there.

Characters, Symbols and the Unicode Miracle - Computerphile by Tom Scott and Sean Riley

What every programmer absolutely, positively needs to know about encodings and character sets to work with text by David C. Zentgraf

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM