简体   繁体   English

UnicodeEncodeError: 'ascii' codec can't encode character '\\xe9' - -when using urlib.request python3

[英]UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' - -when using urlib.request python3

I'm writing a script that goes to a list of links and parses the information.我正在编写一个脚本,该脚本转到链接列表并解析信息。

It works for most sites but It's choking on some with "UnicodeEncodeError: 'ascii' codec can't encode character '\\xe9' in position 13: ordinal not in range(128)"它适用于大多数站点,但它在某些站点上令人窒息,“UnicodeEncodeError:'ascii' 编解码器无法在位置 13 中编码字符 '\\xe9':序号不在范围内(128)”

It stops on client.py which is part of urlib on python3它在 client.py 上停止,它是 python3 上 urlib 的一部分

the exact link is: http://finance.yahoo.com/news /cafés-growing-faster-than-fast-food-peers-144512056.html确切的链接是: http ://finance.yahoo.com/news/cafés-growth-faster-than-fast-food-peers-144512056.html

There are quite a few similar postings here but none of the answers seems to work for me.这里有很多类似的帖子,但似乎没有一个答案对我有用。

my code is:我的代码是:

from urllib import request

def __request(link,debug=0):      

try:
    html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts
    unicode_html = html.decode('utf-8','ignore')

# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
except HTTPError as e:
    if debug:
        print('The server couldn\'t fulfill the request for ' + link)
        print('Error code: ', e.code)
    return ''
except URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('timeout')
        return ''    
else:
    return unicode_html

this calls the request function这会调用请求函数

link = ' http://finance.yahoo.com/news /cafés-growing-faster-than-fast-food-peers-144512056.html' page = __request(link)链接 = ' http://finance.yahoo.com/news /cafés-growth-faster-than-fast-food-peers-144512056.html' 页面 = __request(link)

And the traceback is:回溯是:

Traceback (most recent call last):
  File "<string>", line 250, in run_nodebug
  File "C:\reader\get_news.py", line 276, in <module>
    main()
  File "C:\reader\get_news.py", line 255, in main
    body = get_article_body(item['link'],debug=0)
  File "C:\reader\get_news.py", line 155, in get_article_body
    page = __request('na',url)
  File "C:\reader\get_news.py", line 50, in __request
    html = request.urlopen(link, timeout=35).read()
  File "C:\Python33\Lib\urllib\request.py", line 156, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\Lib\urllib\request.py", line 469, in open
    response = self._open(req, data)
  File "C:\Python33\Lib\urllib\request.py", line 487, in _open
    '_open', req)
  File "C:\Python33\Lib\urllib\request.py", line 447, in _call_chain
    result = func(*args)
  File "C:\Python33\Lib\urllib\request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "C:\Python33\Lib\urllib\request.py", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Python33\Lib\http\client.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "C:\Python33\Lib\http\client.py", line 1089, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Python33\Lib\http\client.py", line 953, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 13: ordinal not in range(128)

Any help appreciated It's driving me crazy , I think I've tried all combinations of x.decode and similar任何帮助表示赞赏这让我发疯,我想我已经尝试了 x.decode 和类似的所有组合

(I could ignore the offending characters if that is possible.) (如果可能的话,我可以忽略有问题的字符。)

Use a percent-encoded URL :使用百分比编码的 URL

link = 'http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html'

I found the above percent-encoded URL by pointing the browser at我通过将浏览器指向了上面的百分比编码的 URL

http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

going to the page, then copying-and-pasting the encoded url supplied by the browser back into the text editor.转到页面,然后将浏览器提供的编码 URL 复制并粘贴回文本编辑器。 However, you can generate a percent-encoded URL programmatically using:但是,您可以使用以下方法以编程方式生成百分比编码的 URL:

from urllib import parse

link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'

scheme, netloc, path, query, fragment = parse.urlsplit(link)
path = parse.quote(path)
link = parse.urlunsplit((scheme, netloc, path, query, fragment))

which yields这产生

http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html

Your URL contains characters that cannot be represented as ASCII characters.您的 URL 包含无法表示为 ASCII 字符的字符。

You'll have to ensure that all characters have been properly URL encoded;您必须确保所有字符都经过正确的 URL 编码; use urllib.parse.quote_plus for example;例如使用urllib.parse.quote_plus it'll use UTF-8 URL-encoded escaping to represent any non-ASCII characters.它将使用 UTF-8 URL 编码转义来表示任何非 ASCII 字符。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeEncodeError:&#39;ascii&#39;编解码器无法编码字符u&#39;\\ xe9&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符u&#39;\\ xe9&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' UnicodeEncodeError: &#39;ascii&#39; 编解码器在 UTF-8 语言环境中打印时无法编码字符 &#39;\\xe9&#39; - UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' when printing in UTF-8 locale Python eyed3 UnicodeEncodeError:&#39;ascii&#39;编解码器无法在位置17编码字符u&#39;\\ xe9&#39;:序数不在范围内(128) - Python eyed3 UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 17: ordinal not in range(128) UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character u&#39;\\xe9&#39; in position 54: ordinal not in range(128) - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 54: ordinal not in range(128) &#39;ascii&#39;编解码器无法编码字符u&#39;\\ xe9&#39; - 'ascii' codec can't encode character u'\xe9' Python3中的“ UnicodeEncodeError:&#39;ascii&#39;编解码器无法编码字符” - “UnicodeEncodeError: 'ascii' codec can't encode character” in Python3 UnicodeEncodeError:&#39;ascii&#39;编解码器无法在位置31编码字符&#39;\\ xe9&#39;:安装金字塔期间序数不在range(128)中 - UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 31: ordinal not in range(128) during installing pyramid Python - &#39;ascii&#39; 编解码器无法对位置 5 中的字符 u&#39;\\xe9&#39; 进行编码:序号不在范围内(128) - Python - 'ascii' codec can't encode character u'\xe9' in position 5: ordinal not in range(128) UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符u&#39;\\ xe4&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4'
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM