reading/writing files with umlauts in python (html to txt)

Question

I know this has been asked several times, but I think I'm doing everything right and it still doesn't work, so before I go clinically insane I'll make a post. This is the code (It's supposed to convert HTML Files to txt files and leave out certain lines):

fid = codecs.open(htmlFile, "r", encoding = "utf-8")
if not fid:
    return
htmlText = fid.read()
fid.close()

stripped = strip_tags(unicode(htmlText))   ### strip html tags (this is not the prob)
lines = stripped.split('\n')
out = []

for line in lines: # just some stuff i want to leave out of the output
    if len(line) < 6:
        continue
    if '*' in line or '(' in line or '@' in line or ':' in line:
        continue
    out.append(line)

result=  '\n'.join(out)
base, ext = os.path.splitext(htmlFile)
outfile = base + '.txt'

fid = codecs.open(outfile, "w", encoding = 'utf-8')
fid.write(result)
fid.close()

Thanks!

Answer 1

Not sure but by doing

'\n'.join(out)

Using a non-unicode string (but a plain old bytes string), you may be falling back to some non-UTF-8 codec. Try:

u'\n'.join(out)

To make sure you're using unicode objects everywhere.

Answer 2

You haven't specified the problem, so this is a complete guess.

What is being returned by your strip_tags() function? Is it returning a unicode object, or is it a byte string? If the latter, it would likely cause decoding issues when you attempt to write it to a file. For example, if strip_tags() is returning a utf-8 encoded byte string:

>>> s = u'This is \xe4 test\nHere is \xe4nother line.'
>>> print s
This is ä test
Here is änother line.

>>> s_utf8 = s.encode('utf-8')
>>> f=codecs.open('test', 'w', encoding='utf8')
>>> f.write(s)    # no problem with this... s is unicode, but
>>> f.write(s_utf8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib64/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

If this is what you are seeing you need to make sure that you pass unicode in fid.write(result) , which probably means ensuring that unicode is returned by strip_tags() .

Also, a couple of other things I noticed in passing:

codecs.open() will raise an IOError exception if it can not open the file. It will not return None, so the if not fid: test will not assist. You need to use try/except , ideally with with .

try:
    with codecs.open(htmlFile, "r", encoding = "utf-8") as fid:
        htmlText = fid.read()
except IOError, e:
    # handle error
    print e

And, data that you read from a file opened via codecs.open() will automatically be converted to unicode, therefore calling unicode(htmlText) achieves nothing.

reading/writing files with umlauts in python (html to txt)

Question

2 answers

solution1
0 2012-07-19 23:23:06

solution2
0 2012-07-20 02:37:32

reading/writing files with umlauts in python (html to txt)

Question

2 answers

solution1 0 2012-07-19 23:23:06

solution2 0 2012-07-20 02:37:32

solution1
0 2012-07-19 23:23:06

solution2
0 2012-07-20 02:37:32