无法在Python中解码BeautifulSoup的输出

Question

我一直在尝试使用BeautifulSoup在Python中编写一些抓取工具。 一切运行顺利，直到我尝试打印（或写入文件）各种HTML元素中包含的字符串。 我要抓取的网站是： http : //www.yellowpages.ca/search/si/1/Boots/Montreal+QC ，其中包含各种法语字符。 出于某种原因，当我尝试在终端或文件中打印内容时，而不是像预期的那样对字符串进行解码，而是获取原始的unicode输出。 这是脚本：

from BeautifulSoup import BeautifulSoup as bs
import urllib as ul
##import re

base_url = 'http://www.yellowpages.ca'
data_file = open('yellow_file.txt', 'a')

data = ul.urlopen(base_url + '/locations/Quebec/Montreal/90014002.html').readlines()

bt = bs(str(data))

result = bt.findAll('div', 'ypgCategory')

bt = bs(str(result))

result = bt.findAll('a')

for tag in result:
    link = base_url + tag['href']
    ##print str(link)
    data = ul.urlopen(link).readlines()

    #data = str(data).decode('latin-1')
    bt = bs(str(data), convertEntities=bs.HTML_ENTITIES, fromEncoding='latin-1')
    titles = bt.findAll('span', 'listingTitle')
    phones = bt.findAll('a', 'phoneNumber')

    entries = zip(titles, phones)

    for title, phone in entries:
        #print title.prettify(encoding='latin-1')
        #data_file.write(title.text.decode('utf-8') + "   " + phone.text.decode('utf-8') + "\n")
        print title.text

data_file.close()

/ * ** * ** * ** * ** /

其输出为：Projets Autochtones Du Qu \\ xc3 \\ xa9bec

如您所见，应该在魁北克使用的带有重音符号的e没有显示。 我已经尝试过在SO上提到的所有内容，调用unicode（），从Encoding传递到汤，.decode（'latin-1'），但我什么也没得到。

有任何想法吗？

Answer 1

这应该是您想要的东西：

from BeautifulSoup import BeautifulSoup as bs
import urllib as ul

base_url = 'http://www.yellowpages.ca'
data_file = open('yellow_file.txt', 'a')

bt = bs(ul.urlopen(base_url + '/locations/Quebec/Montreal/90014002.html'))

for div in bt.findAll('div', 'ypgCategory'):
    for a in div.findAll('a'):
        link = base_url + a['href']

        bt = bs(ul.urlopen(link), convertEntities=bs.HTML_ENTITIES)

        titles = bt.findAll('span', 'listingTitle')
        phones = bt.findAll('a', 'phoneNumber')

        for title, phone in zip(titles, phones):
            line = '%s   %s\n' % (title.text, phone.text)
            data_file.write(line.encode('utf-8'))
            print line.rstrip()

data_file.close()

Answer 2

谁告诉您使用latin-1来解码UTF-8 ？ （在meta标签上明确指定）

如果您在Windows上安装商品，则可能无法将Unicode输出到控制台，最好先测试对文本文件的写入。
如果您以文本形式打开文件，请不要向其写入二进制文件：
- codecs.open(...,"w","utf-8").write(unicode_str)
- open(...,"wb").write(unicode_str.encode("utf_8"))

无法在Python中解码BeautifulSoup的输出

问题描述

2 个解决方案

解决方案1
3 已采纳 2011-12-03 01:55:53

解决方案2
0 2011-12-03 01:17:19

无法在Python中解码BeautifulSoup的输出

问题描述

2 个解决方案

解决方案1 3 已采纳 2011-12-03 01:55:53

解决方案2 0 2011-12-03 01:17:19

解决方案1
3 已采纳 2011-12-03 01:55:53

解决方案2
0 2011-12-03 01:17:19