Python和BeautifulSoup编码问题

Question

我正在使用BeautifulSoup编写一个使用Python的爬虫，一切都在游泳，直到我遇到这个网站：

http://www.elnorte.ec/

我正在获取请求库的内容：

r = requests.get('http://www.elnorte.ec/')
content = r.content

如果我在那时打印内容变量，所有西班牙语特殊字符似乎都正常工作。 但是，一旦我尝试将内容变量提供给BeautifulSoup，它就会搞砸了：

soup = BeautifulSoup(content)
print(soup)
...
<a class="blogCalendarToday" href="/component/blog_calendar/?year=2011&amp;month=08&amp;day=27&amp;modid=203" title="1009 artÃculos en este dÃa">
...

它显然是在拼乱所有西班牙语的特殊角色（口音和诸如此类的东西）。 我尝试过做content.decode（'utf-8'），content.decode（'latin-1'），也尝试将fromEncoding参数搞砸到BeautifulSoup，将其设置为fromEncoding ='utf-8'和fromEncoding ='拉丁-1'，但仍然没有骰子。

任何指针都将非常感激。

Answer 1

在你的情况下，这个页面有错误的utf-8数据混淆了BeautifulSoup并让它认为你的页面使用了windows-1252，你可以这样做：

soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))

通过执行此操作，您将丢弃页面源中的任何错误符号，BeautifulSoup将正确猜测编码。

你可以用'替换'替换'ignore'并检查'？'的文本 符号，看看什么被丢弃。

实际上编写一个非常困难的任务，每次都可以100％的几率猜测页面编码（浏览器现在非常擅长），你可以使用像'chardet'这样的模块，但是，例如，在你的情况下，它会猜测编码作为ISO-8859-2，这也是不正确的。

如果你真的需要能够获得用户可能提供的任何页面的编码 - 你应该构建一个多级（尝试utf-8，尝试latin1，尝试等等）检测功能（就像我们在项目中所做的那样））或使用firefox或chromium的一些检测代码作为C模块。

Answer 2

你能尝试一下：

r = urllib.urlopen('http://www.elnorte.ec/')
x = BeautifulSoup.BeautifulSoup(r.read)
r.close()

print x.prettify('latin-1')

我得到了正确的输出。 哦，在这种特殊情况下你也可以x.__str__(encoding='latin1') 。

我想这是因为内容在ISO-8859-1（5）中，并且元http-equiv内容类型错误地说“UTF-8”。

你能证实吗？

Answer 3

你可以尝试这个，它适用于每个编码

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
headers = {"User-Agent": USERAGENT}
resp = requests.get(url, headers=headers)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)

Answer 4

第一个答案是对的，这个功能有时候是有效的。

    def __if_number_get_string(number):
        converted_str = number
        if isinstance(number, int) or \
            isinstance(number, float):
                converted_str = str(number)
        return converted_str


    def get_unicode(strOrUnicode, encoding='utf-8'):
        strOrUnicode = __if_number_get_string(strOrUnicode)
        if isinstance(strOrUnicode, unicode):
            return strOrUnicode
        return unicode(strOrUnicode, encoding, errors='ignore')

    def get_string(strOrUnicode, encoding='utf-8'):
        strOrUnicode = __if_number_get_string(strOrUnicode)
        if isinstance(strOrUnicode, unicode):
            return strOrUnicode.encode(encoding)
        return strOrUnicode

Answer 5

我建议采用更有条理的傻瓜证明方法。

# 1. get the raw data 
raw = urllib.urlopen('http://www.elnorte.ec/').read()

# 2. detect the encoding and convert to unicode 
content = toUnicode(raw)    # see my caricature for toUnicode below

# 3. pass unicode to beautiful soup. 
soup = BeautifulSoup(content)


def toUnicode(s):
    if type(s) is unicode:
        return s
    elif type(s) is str:
        d = chardet.detect(s)
        (cs, conf) = (d['encoding'], d['confidence'])
        if conf > 0.80:
            try:
                return s.decode( cs, errors = 'replace' )
            except Exception as ex:
                pass 
    # force and return only ascii subset
    return unicode(''.join( [ i if ord(i) < 128 else ' ' for i in s ]))

无论你抛出什么，你都可以推理，它总会向bs发送有效的unicode。

因此，每次有新数据时，解析后的树都会表现得更好，并且不会以更新的更有趣的方式失败。

试验和错误在代码中不起作用 - 组合太多了:-)

Python和BeautifulSoup编码问题

问题描述

5 个解决方案

解决方案1
23 2011-08-28 18:18:23

解决方案2
18 已采纳 2011-08-28 17:38:45

解决方案3
5 2017-08-11 20:50:00

解决方案4
2 2012-07-03 15:51:11

解决方案5
2 2016-12-03 12:47:02

Python和BeautifulSoup编码问题

问题描述

5 个解决方案

解决方案1 23 2011-08-28 18:18:23

解决方案2 18 已采纳 2011-08-28 17:38:45

解决方案3 5 2017-08-11 20:50:00

解决方案4 2 2012-07-03 15:51:11

解决方案5 2 2016-12-03 12:47:02

解决方案1
23 2011-08-28 18:18:23

解决方案2
18 已采纳 2011-08-28 17:38:45

解决方案3
5 2017-08-11 20:50:00

解决方案4
2 2012-07-03 15:51:11

解决方案5
2 2016-12-03 12:47:02