Beautiful Soup HTML 解析异常

Question

我正在尝试使用漂亮的汤从 HTML 中的某个类中获取文本。 我已成功获取文本，但其中存在一些异常（无法识别的字符），如下图所示。 如何使用python代码解决它而不是手动删除这些异常。

代码：

    try:
        html =requests.get(url)
    except:
        print("no conection")
    try:
        soup = BS(html.text,'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

Answer 1

当您访问html.text ，Requests 会尝试确定字符编码，以便正确解码从服务器接收到的原始字节。 timesofindia 发送的content-type头是text/html; charset=iso-8859-1 text/html; charset=iso-8859-1 ，这是 Requests 所用的。 字符编码几乎肯定是utf-8 。

您可以通过在访问html.text之前将html的encoding设置为utf-8来解决此html.text ：

    try:
        html =requests.get(url)
        html.encoding = 'utf-8'
    except:
        print("no conection")
    try:
        soup = BS(html.text,'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

或将html.content解码为utf-8 ，并将其传递给BS而不是html.text ：

    try:
        html =requests.get(url)
    except:
        print("no conection")
    try:
        soup = BS(html.content.decode('utf-8'),'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

我强烈建议您了解字符编码和 Unicode。 很容易被它绊倒。 我们都去过那里。

字符、符号和 Unicode 奇迹 - Tom Scott 和 Sean Riley 的Computerphile

每个程序员绝对需要了解的有关编码和字符集以处理文本的内容 David C. Zentgraf

每个软件开发人员绝对、肯定地必须了解 Unicode 和字符集的绝对最低要求（没有任何借口！）作者：Joel Spolsky

Beautiful Soup HTML 解析异常

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-03-13 09:44:47

Beautiful Soup HTML 解析异常

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-03-13 09:44:47

解决方案1
0 已采纳 2020-03-13 09:44:47