Python beautifulSoup网页解码问题

Question

I'm trying to scrape the following two pages using beautifulSoap4我正在尝试使用 beautifulSoap4抓取以下两页

Both have the same HTML structure.两者具有相同的 HTML 结构。 When I load the first webpage, it's all fine and I get this:当我加载第一个网页时，一切都很好，我得到了这个：

<!DOCTYPE html>
<html dir="rtl" lang="fa-IR">
 <head>
  <style id="litespeed-optm-css-rules">
   ...

But the second webpage output is this:但是第二个网页output是这样的：

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
      �[s�Ƶ(��_���!��3�+ E:�|���lmI����.UИ��&amp; ���!���p���'ە�����~��?1��̩� f0�\ q�
<u*q�"�f��v�[�^}��~|�����e����4� 94�,4�pf�cӗ��̣[="%��[iv*#��0�T:P�kŃ��rӴ�" c��gm_vv۾l�gz���_���yˏ�����8�qw��ȳԕ�:h����="" �@��;��tʳ�="" �h�:a�="" ��@fy="">
 =��È�

Here is my python code:这是我的 python 代码：

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
url = 'https://30nama.kim/top/30nama-movie.html'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
print(page_soup.prettify())

I don't know what happens to the second page and what do these characters mean.我不知道第二页发生了什么，这些字符是什么意思。 I thought I should try to decode it using utf-8 but it didn't work.我想我应该尝试使用 utf-8 对其进行解码，但它没有用。 Any ideas?有任何想法吗？

Answer 1

BeautifulSoup uses Unicode, Dammit to detect the encoding. BeautifulSoup 使用 Unicode，该死的检测编码。 This is not always correct.这并不总是正确的。

I sat the encoding manually and it worked:我手动输入编码并且它起作用了：

page_soup = soup(webpage, "html.parser", from_encoding="ISO-8859-7")

Python beautifulSoup网页解码问题

问题描述

1 个解决方案

解决方案1
1 2020-04-08 19:04:54

Python beautifulSoup网页解码问题

问题描述

1 个解决方案

解决方案1 1 2020-04-08 19:04:54

解决方案1
1 2020-04-08 19:04:54