简体   繁体   English

Python 3和BeautifulSoup4中的UnicodeEncodeError

[英]UnicodeEncodeError in Python 3 and BeautifulSoup4

When running my codes, I get this error 运行代码时,出现此错误

UnicodeEncodeError: 'ascii' codec can't encode character '\̃' in position 71: ordinal not in range(128) UnicodeEncodeError:'ascii'编解码器无法在位置71编码字符'\\ u0303':序数不在范围内(128)

This is my whole codes, 这是我的全部代码,

from urllib.request import urlopen as uReq
from urllib.request import urlretrieve as uRet
from bs4 import BeautifulSoup as soup
import urllib

for x in range(143, 608):
    myUrl = "example.com/" + str(x)
    try:
        uClient = uReq(myUrl)
        page_html = uClient.read()
        uClient.close()
        page_soup = soup(page_html, "html.parser")

        container = page_soup.findAll("div", {"id": "videoPostContent"})

        img_container = container[0].findAll("img")
        images = img_container[0].findAll("img")

        imgCounter = 0

        if len(images) == "":
            for image in images:
                print('Downloading image from ' + image['src'] + '...')
                imgCounter += 1
                uRet(image['src'], 'pictures/' + str(x) + '.jpg')
        else:
            for image in img_container:
                print('Downloading image from ' + image['src'] + '...')
                imgCounter += 1
                uRet(image['src'], 'pictures/' + str(x) + '_' + str(imgCounter) + '.jpg')
    except urllib.error.HTTPError:
        continue

Tried Solutions: 尝试过的解决方案:

I tried adding .encode/decode('utf-8') and .text.encode/decode('utf-8') to page_soup but it gives this errors. 我尝试将.encode/decode('utf-8').text.encode/decode('utf-8')page_soup但它给出了此错误。

AttributeError: 'str' / 'bytes' object has no attribute 'findAll' or AttributeError:'str'/'bytes'对象没有属性'findAll'或

At least one of the image src urls contain non-ascii characters, and urlretrieve is unable to process them. 图像src网址中至少有一个包含非ASCII字符,而urlretrieve无法处理它们。

>>> url = 'http://example.com/' + '\u0303'
>>> urlretrieve(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
UnicodeEncodeError: 'ascii' codec can't encode character '\u0303' in position 5: ordinal not in range(128)

You could try one of these approaches to get around this problem. 您可以尝试这些方法之一来解决此问题。

  1. Assume that these urls are valid, and retrieve them using a library that has better unicode handling, like requests . 假定这些url有效,然后使用具有更好unicode处理能力的库(如request)检索它们。

  2. Assume that the urls are valid, but contain unicode characters that must be escaped before passing to urlretrieve . 假设url是有效的,但是包含unicode字符,在传递给urlretrieve之前必须对其进行转义。 This would entail splitting the url into scheme, domain, path etc, quoting the path and any query parameters and then unsplitting; 这将需要将url拆分为方案,域,路径等,并引用路径和任何查询参数,然后进行不拆分; all the tools for this are in the urllib.parse package (but this is probably what requests does anyway, so just use requests). 所有用于此目的的工具都在urllib.parse包中(但是无论如何,这可能就是请求所执行的操作,因此只使用请求即可)。

  3. Assume that these urls are broken and skip them by wrapping your urlretrieve calls with try/except UnicodeEncodeError 假定这些url已损坏,并通过用try/except UnicodeEncodeError包装urlretrieve调用来跳过它们

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM