Python 3和BeautifulSoup4中的UnicodeEncodeError

Question

When running my codes, I get this error 运行代码时，出现此错误

UnicodeEncodeError: 'ascii' codec can't encode character '\̃' in position 71: ordinal not in range(128) UnicodeEncodeError：'ascii'编解码器无法在位置71编码字符'\\ u0303'：序数不在范围内（128）

This is my whole codes, 这是我的全部代码，

from urllib.request import urlopen as uReq
from urllib.request import urlretrieve as uRet
from bs4 import BeautifulSoup as soup
import urllib

for x in range(143, 608):
    myUrl = "example.com/" + str(x)
    try:
        uClient = uReq(myUrl)
        page_html = uClient.read()
        uClient.close()
        page_soup = soup(page_html, "html.parser")

        container = page_soup.findAll("div", {"id": "videoPostContent"})

        img_container = container[0].findAll("img")
        images = img_container[0].findAll("img")

        imgCounter = 0

        if len(images) == "":
            for image in images:
                print('Downloading image from ' + image['src'] + '...')
                imgCounter += 1
                uRet(image['src'], 'pictures/' + str(x) + '.jpg')
        else:
            for image in img_container:
                print('Downloading image from ' + image['src'] + '...')
                imgCounter += 1
                uRet(image['src'], 'pictures/' + str(x) + '_' + str(imgCounter) + '.jpg')
    except urllib.error.HTTPError:
        continue

Tried Solutions: 尝试过的解决方案：

I tried adding .encode/decode('utf-8') and .text.encode/decode('utf-8') to page_soup but it gives this errors. 我尝试将.encode/decode('utf-8')和.text.encode/decode('utf-8')到page_soup但它给出了此错误。

AttributeError: 'str' / 'bytes' object has no attribute 'findAll' or AttributeError：'str'/'bytes'对象没有属性'findAll'或

Answer 1

At least one of the image src urls contain non-ascii characters, and urlretrieve is unable to process them. 图像src网址中至少有一个包含非ASCII字符，而urlretrieve无法处理它们。

>>> url = 'http://example.com/' + '\u0303'
>>> urlretrieve(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
UnicodeEncodeError: 'ascii' codec can't encode character '\u0303' in position 5: ordinal not in range(128)

You could try one of these approaches to get around this problem. 您可以尝试这些方法之一来解决此问题。

Assume that these urls are valid, and retrieve them using a library that has better unicode handling, like requests . 假定这些url有效，然后使用具有更好unicode处理能力的库（如request）检索它们。
Assume that the urls are valid, but contain unicode characters that must be escaped before passing to urlretrieve . 假设url是有效的，但是包含unicode字符，在传递给urlretrieve之前必须对其进行转义。 This would entail splitting the url into scheme, domain, path etc, quoting the path and any query parameters and then unsplitting; 这将需要将url拆分为方案，域，路径等，并引用路径和任何查询参数，然后进行不拆分； all the tools for this are in the urllib.parse package (but this is probably what requests does anyway, so just use requests). 所有用于此目的的工具都在urllib.parse包中（但是无论如何，这可能就是请求所执行的操作，因此只使用请求即可）。
Assume that these urls are broken and skip them by wrapping your urlretrieve calls with try/except UnicodeEncodeError 假定这些url已损坏，并通过用try/except UnicodeEncodeError包装urlretrieve调用来跳过它们

Python 3和BeautifulSoup4中的UnicodeEncodeError

问题描述

1 个解决方案

解决方案1
0 2017-10-15 08:34:03

Python 3和BeautifulSoup4中的UnicodeEncodeError

问题描述

1 个解决方案

解决方案1 0 2017-10-15 08:34:03

解决方案1
0 2017-10-15 08:34:03