Problems with encoding Website in Python. Getting 'charmap' codec can't encode character '\x9f' in position

Question

I want to build an RSS Feed Reader by myself. So I started up.

My Test Page, from where I get my feed is ' http://heise.de.feedsportal.com/c/35207/f/653902/index.rss '.

It is a German page , because of that I choose as decoding "iso-8859-1". So here is the code.

def main():
counter = 0
try:
    page = 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'
    sourceCode = opener.open(page).read().decode('iso-8859-1')
except Exception as e:
    print(str(e))
    #print sourceCode
try:
    titles = re.findall(r'<title>(.*?)</title>',sourceCode)
    links = re.findall(r'<link>(.*?)</link>',sourceCode)
except Exception as e:
    print(str(e))     
rssFeeds = []
for link in links:
    if "rss." in link:
        rssFeeds.append(link)
for feed in rssFeeds:
    if ('html' in feed) or ('htm' in feed):
        try:
            print("Besuche " + feed+ ":")
            feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
        except Exception as e:
            print(str(e))   
        content = re.findall(r'<p>(.*?)</p>', feedSource)
        try:
            tempTxt = open("feed" + str(counter)+".txt", "w")
            for line in content:
                tempTxt.write(tagFilter(line))
        except Exception as e:
            print(str(e))
        finally:
            tempTxt.close()
            counter += 1
            time.sleep(10)

First of all I start by opening the website I mentioned before. And so far there seems not to be any problem with opening it.
After decoding the website I search in it for all expression which are inside a Link Tags.
Now I select those links which have "rss" in them. Which get stored in a new list.
With the new list, I start opening the links and search there fore there content.

And now start the problems. I decode those sides, still german sides, and I get errors like:

'charmap' codec can't encode character '\\x9f' in position 339: character maps to
'charmap' codec can't encode character '\\x9c' in position 43: character maps to
'charmap' codec can't encode character '\\x80' in position 131: character maps to

And I really have no Idea why it won't work. The data which is collected before the error appears gets written into an textfile.

Example for collected data:

Einloggen auf heise onlineTopthemen:Nachdem Google Anfang des Monats eine 64-Bit-Beta seines hauseigenen Browsers Chrome für Windows 7 und Windows 8 vorgestellt hatte, kümmert sich der Internetriese nun auch um OS X. Wie Tester melden, verbreitet Google über seine Canary-/Dev-Kanäle für Entwickler und Early Adopter nun automatisch 64-Bit-Builds, wenn der User über einen kompatiblen Rechner verfügt.

I hope someone can help me. Also other clues or information which will help me build my own rss feed reader are welcome.

Greetings Templum

Answer 1

Per miko and Wooble's comment:

iso-8859-1 should be utf-8 since the XML returned says the encoding is utf-8 :

In [71]: sourceCode = opener.open(page).read()

In [72]: sourceCode[:100]
Out[72]: "<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet type='text/xsl' href='http://heise.de.feedspo"

and you really ought to be using an XML parser like lxml or BeautifulSoup to parse XML. It's more error prone to be using only the re module.

feedSource is a unicode since it is the result of a decoding:

        feedSource = opener.open(feed).read().decode("utf-8","replace")

Therefore, line is also unicode :

    content = re.findall(r'<p>(.*?)</p>', feedSource)
    for line in content:
        ...

tempTxt is a plain file handle (as opposed to one opened with io.open , which takes an encoding parameter). So tempTxt expects bytes (eg a str ), not unicode .

So encode the line before writing to the file:

        for line in content:
            tempTxt.write(line.encode('utf-8'))

or define tempTxt using io.open and specify an encoding:

import io
with io.open(filename, "w", encoding='utf-8') as tempTxt:
    for line in content:
        tempTxt.write(line)

By the way, it's not good to catch all Exceptions unless you are ready to handle all Exceptions:

    except Exception as e:
        print(str(e))

and furthermore, if you only print the error message, then Python may execute subsequent code even though variables defined in the try section are undefined. For example,

    try:
        print("Besuche " + feed+ ":")
        feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
    except Exception as e:
        print(str(e))   
    content = re.findall(r'<p>(.*?)</p>', feedSource)

using feedSource in the call to re.findall may raise a NameError if an exception was raised before feedSource was defined.

You might want to add a continue statement in the except-suite if you want Python to pass over this feed and move on to the next:

    except Exception as e:
        print(str(e))   
        continue

Problems with encoding Website in Python. Getting 'charmap' codec can't encode character '\x9f' in position

Question

1 answers

solution1
2 ACCPTED 2014-08-06 13:07:04

Problems with encoding Website in Python. Getting 'charmap' codec can't encode character '\x9f' in position

Question

1 answers

solution1 2 ACCPTED 2014-08-06 13:07:04

solution1
2 ACCPTED 2014-08-06 13:07:04