[英]Problems with encoding Website in Python. Getting 'charmap' codec can't encode character '\x9f' in position
I want to build an RSS Feed Reader by myself. 我想自己构建一个RSS Feed阅读器。 So I started up. 所以我开始了。
My Test Page, from where I get my feed is ' http://heise.de.feedsportal.com/c/35207/f/653902/index.rss '. 从我的提要中获得的“我的测试页”是“ http://heise.de.feedsportal.com/c/35207/f/653902/index.rss ”。
It is a German page , because of that I choose as decoding "iso-8859-1". 因为它是德语页面,因此我选择解码为“ iso-8859-1”。 So here is the code. 所以这是代码。
def main():
counter = 0
try:
page = 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'
sourceCode = opener.open(page).read().decode('iso-8859-1')
except Exception as e:
print(str(e))
#print sourceCode
try:
titles = re.findall(r'<title>(.*?)</title>',sourceCode)
links = re.findall(r'<link>(.*?)</link>',sourceCode)
except Exception as e:
print(str(e))
rssFeeds = []
for link in links:
if "rss." in link:
rssFeeds.append(link)
for feed in rssFeeds:
if ('html' in feed) or ('htm' in feed):
try:
print("Besuche " + feed+ ":")
feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
except Exception as e:
print(str(e))
content = re.findall(r'<p>(.*?)</p>', feedSource)
try:
tempTxt = open("feed" + str(counter)+".txt", "w")
for line in content:
tempTxt.write(tagFilter(line))
except Exception as e:
print(str(e))
finally:
tempTxt.close()
counter += 1
time.sleep(10)
And now start the problems. 现在开始问题。 I decode those sides, still german sides, and I get errors like: 我解码了这些方面,仍然是德国方面,并且出现如下错误:
And I really have no Idea why it won't work. 而且我真的不知道为什么它不起作用。 The data which is collected before the error appears gets written into an textfile. 错误出现之前收集的数据将被写入文本文件。
Example for collected data: 收集数据的示例:
Einloggen auf heise onlineTopthemen:Nachdem Google Anfang des Monats eine 64-Bit-Beta seines hauseigenen Browsers Chrome für Windows 7 und Windows 8 vorgestellt hatte, kümmert sich der Internetriese nun auch um OS X. Wie Tester melden, verbreitet Google über seine Canary-/Dev-Kanäle für Entwickler und Early Adopter nun automatisch 64-Bit-Builds, wenn der User über einen kompatiblen Rechner verfügt. 互联网上的热门话题:Nachdem Google Anfang des Monats eine 64位测试版浏览器Chrome浏览器,适用于Windows 7和Windows 8操作系统,适用于OS X操作系统。Wie Tester melden,适用于Google的浏览器Dev-KanälefürEntwickler和早期采用者自动64位版本,由用户使用Rechnerverfügt。
I hope someone can help me. 我希望有一个人可以帮助我。 Also other clues or information which will help me build my own rss feed reader are welcome. 也欢迎其他有助于我建立自己的rss feed阅读器的线索或信息。
Greetings Templum 问候圣殿
Per miko and Wooble's comment: Per miko和Wooble的评论:
iso-8859-1
should be utf-8
since the XML returned says the encoding is utf-8
: iso-8859-1
应该为utf-8
因为返回的XML表示编码为utf-8
:
In [71]: sourceCode = opener.open(page).read()
In [72]: sourceCode[:100]
Out[72]: "<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet type='text/xsl' href='http://heise.de.feedspo"
and you really ought to be using an XML parser like lxml or BeautifulSoup to parse XML. 并且您确实应该使用XML解析器(例如lxml或BeautifulSoup)来解析XML。 It's more error prone to be using only the re
module. 仅使用re
模块更容易出错。
feedSource
is a unicode
since it is the result of a decoding: feedSource
是unicode
因为它是解码的结果:
feedSource = opener.open(feed).read().decode("utf-8","replace")
Therefore, line
is also unicode
: 因此, line
也是unicode
:
content = re.findall(r'<p>(.*?)</p>', feedSource)
for line in content:
...
tempTxt
is a plain file handle (as opposed to one opened with io.open
, which takes an encoding parameter). tempTxt
是一个纯文件句柄(与使用io.open
打开的文件句柄相反,它带有一个编码参数)。 So tempTxt
expects bytes (eg a str
), not unicode
. 因此, tempTxt
需要字节(例如str
),而不是unicode
。
So encode the line
before writing to the file: 因此,在写入文件之前对line
进行编码:
for line in content:
tempTxt.write(line.encode('utf-8'))
or define tempTxt
using io.open
and specify an encoding: 或使用io.open
定义tempTxt
并指定编码:
import io
with io.open(filename, "w", encoding='utf-8') as tempTxt:
for line in content:
tempTxt.write(line)
By the way, it's not good to catch all Exceptions unless you are ready to handle all Exceptions: 顺便说一句,除非您准备好处理所有异常,否则捕获所有异常不是一件好事:
except Exception as e:
print(str(e))
and furthermore, if you only print the error message, then Python may execute subsequent code even though variables defined in the try
section are undefined. 而且,如果仅输出错误消息,那么即使try
节中定义的变量未定义,Python也会执行后续代码。 For example, 例如,
try:
print("Besuche " + feed+ ":")
feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
except Exception as e:
print(str(e))
content = re.findall(r'<p>(.*?)</p>', feedSource)
using feedSource
in the call to re.findall
may raise a NameError if an exception was raised before feedSource
was defined. 如果在定义feedSource
之前feedSource
异常,则在对feedSource
的调用中使用re.findall
可能引发NameError。
You might want to add a continue
statement in the except-suite
if you want Python to pass over this feed
and move on to the next: 如果您想让Python跳过此feed
并继续进行下一个操作,则可能需要在except-suite
添加continue
语句:
except Exception as e:
print(str(e))
continue
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.