简体   繁体   English

用Python编码网站的问题。 获取“ charmap”编解码器无法在位置编码字符“ \\ x9f”

[英]Problems with encoding Website in Python. Getting 'charmap' codec can't encode character '\x9f' in position

I want to build an RSS Feed Reader by myself. 我想自己构建一个RSS Feed阅读器。 So I started up. 所以我开始了。

My Test Page, from where I get my feed is ' http://heise.de.feedsportal.com/c/35207/f/653902/index.rss '. 从我的提要中获得的“我的测试页”是“ http://heise.de.feedsportal.com/c/35207/f/653902/index.rss ”。

It is a German page , because of that I choose as decoding "iso-8859-1". 因为它是德语页面,因此我选择解码为“ iso-8859-1”。 So here is the code. 所以这是代码。

def main():
counter = 0
try:
    page = 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'
    sourceCode = opener.open(page).read().decode('iso-8859-1')
except Exception as e:
    print(str(e))
    #print sourceCode
try:
    titles = re.findall(r'<title>(.*?)</title>',sourceCode)
    links = re.findall(r'<link>(.*?)</link>',sourceCode)
except Exception as e:
    print(str(e))     
rssFeeds = []
for link in links:
    if "rss." in link:
        rssFeeds.append(link)
for feed in rssFeeds:
    if ('html' in feed) or ('htm' in feed):
        try:
            print("Besuche " + feed+ ":")
            feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
        except Exception as e:
            print(str(e))   
        content = re.findall(r'<p>(.*?)</p>', feedSource)
        try:
            tempTxt = open("feed" + str(counter)+".txt", "w")
            for line in content:
                tempTxt.write(tagFilter(line))
        except Exception as e:
            print(str(e))
        finally:
            tempTxt.close()
            counter += 1
            time.sleep(10)
  1. First of all I start by opening the website I mentioned before. 首先,我从打开前面提到的网站开始。 And so far there seems not to be any problem with opening it. 到目前为止,打开它似乎没有任何问题。
  2. After decoding the website I search in it for all expression which are inside a Link Tags. 解码网站后,我在其中搜索链接标记内的所有表达式。
  3. Now I select those links which have "rss" in them. 现在,我选择其中包含“ rss”的那些链接。 Which get stored in a new list. 将其存储在新列表中。
  4. With the new list, I start opening the links and search there fore there content. 使用新列表,我开始打开链接并在那里搜索内容。

And now start the problems. 现在开始问题。 I decode those sides, still german sides, and I get errors like: 我解码了这些方面,仍然是德国方面,并且出现如下错误:

  • 'charmap' codec can't encode character '\\x9f' in position 339: character maps to 'charmap'编解码器无法在位置339处编码字符'\\ x9f':字符映射到
  • 'charmap' codec can't encode character '\\x9c' in position 43: character maps to 'charmap'编解码器无法在位置43编码字符'\\ x9c':字符映射到
  • 'charmap' codec can't encode character '\\x80' in position 131: character maps to 'charmap'编解码器无法在位置131编码字符'\\ x80':字符映射到

And I really have no Idea why it won't work. 而且我真的不知道为什么它不起作用。 The data which is collected before the error appears gets written into an textfile. 错误出现之前收集的数据将被写入文本文件。

Example for collected data: 收集数据的示例:

Einloggen auf heise onlineTopthemen:Nachdem Google Anfang des Monats eine 64-Bit-Beta seines hauseigenen Browsers Chrome für Windows 7 und Windows 8 vorgestellt hatte, kümmert sich der Internetriese nun auch um OS X. Wie Tester melden, verbreitet Google über seine Canary-/Dev-Kanäle für Entwickler und Early Adopter nun automatisch 64-Bit-Builds, wenn der User über einen kompatiblen Rechner verfügt. 互联网上的热门话题:Nachdem Google Anfang des Monats eine 64位测试版浏览器Chrome浏览器,适用于Windows 7和Windows 8操作系统,适用于OS X操作系统。Wie Tester melden,适用于Google的浏览器Dev-KanälefürEntwickler和早期采用者自动64位版本,由用户使用Rechnerverfügt。

I hope someone can help me. 我希望有一个人可以帮助我。 Also other clues or information which will help me build my own rss feed reader are welcome. 也欢迎其他有助于我建立自己的rss feed阅读器的线索或信息。

Greetings Templum 问候圣殿

Per miko and Wooble's comment: Per miko和Wooble的评论:

iso-8859-1 should be utf-8 since the XML returned says the encoding is utf-8 : iso-8859-1应该为utf-8因为返回的XML表示编码为utf-8

In [71]: sourceCode = opener.open(page).read()

In [72]: sourceCode[:100]
Out[72]: "<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet type='text/xsl' href='http://heise.de.feedspo"

and you really ought to be using an XML parser like lxml or BeautifulSoup to parse XML. 并且您确实应该使用XML解析器(例如lxmlBeautifulSoup)来解析XML。 It's more error prone to be using only the re module. 仅使用re模块更容易出错。


feedSource is a unicode since it is the result of a decoding: feedSourceunicode因为它是解码的结果:

        feedSource = opener.open(feed).read().decode("utf-8","replace")

Therefore, line is also unicode : 因此, line也是unicode

    content = re.findall(r'<p>(.*?)</p>', feedSource)
    for line in content:
        ...

tempTxt is a plain file handle (as opposed to one opened with io.open , which takes an encoding parameter). tempTxt是一个纯文件句柄(与使用io.open打开的文件句柄相反,它带有一个编码参数)。 So tempTxt expects bytes (eg a str ), not unicode . 因此, tempTxt需要字节(例如str ),而不是unicode

So encode the line before writing to the file: 因此,在写入文件之前对line进行编码:

        for line in content:
            tempTxt.write(line.encode('utf-8'))

or define tempTxt using io.open and specify an encoding: 或使用io.open定义tempTxt并指定编码:

import io
with io.open(filename, "w", encoding='utf-8') as tempTxt:
    for line in content:
        tempTxt.write(line)

By the way, it's not good to catch all Exceptions unless you are ready to handle all Exceptions: 顺便说一句,除非您准备好处理所有异常,否则捕获所有异常不是一件好事:

    except Exception as e:
        print(str(e))   

and furthermore, if you only print the error message, then Python may execute subsequent code even though variables defined in the try section are undefined. 而且,如果仅输出错误消息,那么即使try节中定义的变量未定义,Python也会执行后续代码。 For example, 例如,

    try:
        print("Besuche " + feed+ ":")
        feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
    except Exception as e:
        print(str(e))   
    content = re.findall(r'<p>(.*?)</p>', feedSource)

using feedSource in the call to re.findall may raise a NameError if an exception was raised before feedSource was defined. 如果在定义feedSource之前feedSource异常,则在对feedSource的调用中使用re.findall可能引发NameError。

You might want to add a continue statement in the except-suite if you want Python to pass over this feed and move on to the next: 如果您想让Python跳过此feed并继续进行下一个操作,则可能需要在except-suite添加continue语句:

    except Exception as e:
        print(str(e))   
        continue

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeEncodeError: &#39;charmap&#39; codec can&#39;t encode character &#39;\\x9f&#39; in position 47: character maps to<undefined> - UnicodeEncodeError: 'charmap' codec can't encode character '\x9f' in position 47: character maps to <undefined> 字符编码错误:UnicodeEncodeError:&#39;charmap&#39;编解码器无法对位置Y中的字符X进行编码:字符映射到<undefined> - Character encoding error: UnicodeEncodeError: 'charmap' codec can't encode character X in position Y: character maps to <undefined> Python编码NLTK-&#39;charmap&#39;编解码器无法编码字符 - Python Encoding NLTK - 'charmap' codec can't encode character UnicodeEncodeError: &#39;charmap&#39; codec can&#39;t encode character &#39;\ğ&#39; in position 1087: character maps to<undefined> - UnicodeEncodeError: 'charmap' codec can't encode character '\u011f' in position 1087: character maps to <undefined> UnicodeEncodeError: &#39;charmap&#39; codec can&#39;t encode character &#39;\\U0001f937&#39; in position 0: character maps to<undefined> - UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f937' in position 0: character maps to <undefined> 编码错误 - charmap' 编解码器无法对字符 'ş' 进行编码 - Encoding Error - charmap' codec can't encode character '\u015f' UnicodeDecodeError:“ charmap”编解码器无法在位置Y处编码字符X:字符映射为未定义 - UnicodeDecodeError: 'charmap' codec can't encode character X at position Y: character maps to undefined UnicodeEncodeError:&#39;charmap&#39;编解码器无法对位置0中的字符&#39;\\ x80&#39;进行编码:字符映射到<undefined> - UnicodeEncodeError : 'charmap' codec can't encode character '\x80' in position 0 : character maps to <undefined> UnicodeEncodeError:&#39;charmap&#39;编解码器无法在位置206中编码字符&#39;\\ x97&#39;:字符映射到<undefined> - UnicodeEncodeError: 'charmap' codec can't encode character '\x97' in position 206: character maps to <undefined> UnicodeEncodeError:'charmap' 编解码器无法在 position 102 中对字符 '\x85' 进行编码:字符映射到<undefined></undefined> - UnicodeEncodeError :'charmap' codec can't encode character '\x85' in position 102: character maps to <undefined>
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM