I'm trying to scrape the following two pages using beautifulSoap4
Both have the same HTML structure. When I load the first webpage, it's all fine and I get this:
<!DOCTYPE html>
<html dir="rtl" lang="fa-IR">
<head>
<style id="litespeed-optm-css-rules">
...
But the second webpage output is this:
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
�[s�Ƶ(��_���!��3�+ E:�|���lmI����.UИ��& ���!���p���'ە�����~��?1��̩� f0�\ q�
<u*q�"�f��v�[�^}��~|�����e����4� 94�,4�pf�cӗ��̣[="%��[iv*#��0�T:P�kŃ��rӴ�" c��gm_vv۾l�gz���_���yˏ�����8�qw��ȳԕ�:h����="" �@��;��tʳ�="" �h�:a�="" ��@fy="">
=���
Here is my python code:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
url = 'https://30nama.kim/top/30nama-movie.html'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
print(page_soup.prettify())
I don't know what happens to the second page and what do these characters mean. I thought I should try to decode it using utf-8 but it didn't work. Any ideas?
BeautifulSoup uses Unicode, Dammit to detect the encoding. This is not always correct.
I sat the encoding manually and it worked:
page_soup = soup(webpage, "html.parser", from_encoding="ISO-8859-7")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.