简体   繁体   中英

UnicodeEncodeError in Python 3 and BeautifulSoup4

When running my codes, I get this error

UnicodeEncodeError: 'ascii' codec can't encode character '\̃' in position 71: ordinal not in range(128)

This is my whole codes,

from urllib.request import urlopen as uReq
from urllib.request import urlretrieve as uRet
from bs4 import BeautifulSoup as soup
import urllib

for x in range(143, 608):
    myUrl = "example.com/" + str(x)
    try:
        uClient = uReq(myUrl)
        page_html = uClient.read()
        uClient.close()
        page_soup = soup(page_html, "html.parser")

        container = page_soup.findAll("div", {"id": "videoPostContent"})

        img_container = container[0].findAll("img")
        images = img_container[0].findAll("img")

        imgCounter = 0

        if len(images) == "":
            for image in images:
                print('Downloading image from ' + image['src'] + '...')
                imgCounter += 1
                uRet(image['src'], 'pictures/' + str(x) + '.jpg')
        else:
            for image in img_container:
                print('Downloading image from ' + image['src'] + '...')
                imgCounter += 1
                uRet(image['src'], 'pictures/' + str(x) + '_' + str(imgCounter) + '.jpg')
    except urllib.error.HTTPError:
        continue

Tried Solutions:

I tried adding .encode/decode('utf-8') and .text.encode/decode('utf-8') to page_soup but it gives this errors.

AttributeError: 'str' / 'bytes' object has no attribute 'findAll' or

At least one of the image src urls contain non-ascii characters, and urlretrieve is unable to process them.

>>> url = 'http://example.com/' + '\u0303'
>>> urlretrieve(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
UnicodeEncodeError: 'ascii' codec can't encode character '\u0303' in position 5: ordinal not in range(128)

You could try one of these approaches to get around this problem.

  1. Assume that these urls are valid, and retrieve them using a library that has better unicode handling, like requests .

  2. Assume that the urls are valid, but contain unicode characters that must be escaped before passing to urlretrieve . This would entail splitting the url into scheme, domain, path etc, quoting the path and any query parameters and then unsplitting; all the tools for this are in the urllib.parse package (but this is probably what requests does anyway, so just use requests).

  3. Assume that these urls are broken and skip them by wrapping your urlretrieve calls with try/except UnicodeEncodeError

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM