简体   繁体   中英

Wrong accented characters using Beautiful Soup in Python on a local HTML file

I'm quite familiar with Beautiful Soup in Python, I have always used to scrape live site.

Now I'm scraping a local HTML file ( link , in case you want to test the code), the only problem is that accented characters are not represented in the correct way (this never happened to me when scraping live sites).

This is a simplified version of the code

import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('AH.html'), "html.parser")
tables = soup.find_all('table')
titles = tables[0].find_all('tr')
print(titles[55].text)

which prints the following output

2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]

while the correct output should be

2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]


I looked for a solution, read many questions/answers and found this answer , which I implemented in the following way

import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup
import codecs

response = open('AH.html')
content = response.read()
html = codecs.decode(content, 'utf-8')
soup = BeautifulSoup(html, "html.parser")

However, it runs the following error

Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
TypeError: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\user\Desktop\score.py", line 8, in <module>
    html = codecs.decode(content, 'utf-8')
TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'str')

I guess is easy to solve the problem, but how to do it?

from bs4 import BeautifulSoup


with open("AH.html") as f:
    soup = BeautifulSoup(f, 'html.parser')
    tb = soup.find("table")
    for item in tb.find_all("tr")[55]:
        print(item.text)

在此处输入图片说明

I've to say, that your first code is actually fine and should works.

Regarding the second code, you are trying to decode str which is faulty. as decode function is for byte object .

I believe that you are using Windows where the default encoding of it is cp1252 not UTF-8 .

Could you please run the following code:

import sys

print(sys.getdefaultencoding())
print(sys.stdin.encoding)
print(sys.stdout.encoding)
print(sys.stderr.encoding)

And check your output if it's UTF-8 or cp1252 .

note that if you are using VSCode with Code-Runner , kindly run your code in the terminal as py code.py

SOLUTIONS (from the chat)

(1) If you are on windows 10

  • Open Control Panel and change view by Small icons
  • Click Region
  • Click the Administrative tab
  • Click on Change system locale...
  • Tick the box "Beta: Use Unicode UTF-8..."
  • Click OK and restart your pc

(2) If you are not on Windows 10 or just don't want to change the previous setting, then in the first code change open("AH.html") to open("AH.html", encoding="UTF-8") , that is write:

from bs4 import BeautifulSoup

with open("AH.html", encoding="UTF-8") as f:
    soup = BeautifulSoup(f, 'html.parser')
    tb = soup.find("table")
    for item in tb.find_all("tr")[55]:
        print(item.text)

Using open('AH.html') decodes the file using a default encoding that may not be the encoding of the file. BeautifulSoup understands the HTML headers, specifically the following content indicates the file is UTF-8-encoded:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Open the file in binary mode and let BeautifulSoup figure it out:

with open("AH.html","rb") as f:
    soup = BeautifulSoup(f, 'html.parser')

Sometimes, websites set the encoding incorrectly. In that case you can specify the encoding yourself if you know what it should be.

with open("AH.html",encoding='utf8') as f:
    soup = BeautifulSoup(f, 'html.parser')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM