Redirect stdout to a file with unicode encoding while keeping windows eol in python 2

Question

I hit a wall here. I need to redirect all output to a file but I need this file to be encoded in utf-8. Problem is that when using codecs.open :

# errLog = io.open(os.path.join(os.getcwdu(),u'BashBugDump.log'), 'w',
#                  encoding='utf-8')
errLog = codecs.open(os.path.join(os.getcwdu(), u'BashBugDump.log'),
                     'w', encoding='utf-8')
sys.stdout = errLog
sys.stderr = errLog

codecs opens the file in binary mode resulting in \\n line terminators. I tried using io.open but this does not play with the print statement used all over the codebase (see Python 2.7: print doesn't speak unicode to the io module? or python: TypeError: can't write str to text stream )

I am not the only one having this issue for instance see here but the solution they adopted is specific to the logging module we do not use.

See also this won't fix bug in python: https://bugs.python.org/issue2131

So what's the one right way for doing this in python2 ?

Answer 1

Option 1

Redirection is a shell operation. You don't have to change the Python code at all, but you do have to tell Python what encoding to use if redirected. That is done with an environment variable. The following code redirects both stdout and stderr to a UTF-8-encoded file:

test.bat

set PYTHONIOENCODING=utf8
python test.py >out.txt 2>&1

test.py

#coding:utf8
import sys
print u"我不喜欢你女朋友！"
print >>sys.stderr, u"你需要一个新的。"

out.txt (encoded in UTF-8)

我不喜欢你女朋友！
你需要一个新的。

Hex dump of out.txt

0000: E6 88 91 E4 B8 8D E5 96 9C E6 AC A2 E4 BD A0 E5
0010: A5 B3 E6 9C 8B E5 8F 8B EF BC 81 0D 0A E4 BD A0 
0020: E9 9C 80 E8 A6 81 E4 B8 80 E4 B8 AA E6 96 B0 E7
0030: 9A 84 E3 80 82 0D 0A

Note: You do need to print Unicode strings for this to work. Print byte strings and you get the bytes you print.

Option 2

codecs.open may force binary mode, but codecs.getwriter doesn't. Give it a file opened in text mode:

#coding:utf8
import sys
import codecs
sys.stdout = sys.stderr = codecs.getwriter('utf8')(open('out.txt','w'))
print u"我不喜欢你女朋友！"
print >>sys.stderr, u"你需要一个新的。"

(same output and hexdump as above)

Answer 2

It appears that the Python 2 version of io doesn't play well with the print statement, but it will work if you use the print function.

Demo:

from __future__ import print_function
import sys
import io

errLog = io.open('test.log', mode='wt', buffering=1, encoding='utf-8', newline='\r\n')
sys.stdout = errLog

print(u'This is a ™ test')
print(u'Another © line')

contents of 'test.log'

This is a ™ test
Another © line

hexdump of 'test.log'

00000000  54 68 69 73 20 69 73 20  61 20 e2 84 a2 20 74 65  |This is a ... te|
00000010  73 74 0d 0a 41 6e 6f 74  68 65 72 20 c2 a9 20 6c  |st..Another .. l|
00000020  69 6e 65 0d 0a                                    |ine..|
00000025

I ran this code on Python 2.6 on Linux, YMMV.

If you don't want to use the print function, you can implement your own file-like encoding class.

import sys

class Encoder(object):
    def __init__(self, fname):
        self.file = open(fname, 'wb')

    def write(self, s):
        self.file.write(s.replace('\n', '\r\n').encode('utf-8'))

errlog = Encoder('test.log')
sys.stdout = errlog
sys.stderr = errlog

print 'hello\nthere'
print >>sys.stderr, u'This is a ™ test'
print u'Another © line'
print >>sys.stderr, 1, 2, 3, 4
print 5, 6, 7, 8

contents of 'test.log'

hello
there
This is a ™ test
Another © line
1 2 3 4
5 6 7 8

hexdump of 'test.log'

00000000  68 65 6c 6c 6f 0d 0a 74  68 65 72 65 0d 0a 54 68  |hello..there..Th|
00000010  69 73 20 69 73 20 61 20  e2 84 a2 20 74 65 73 74  |is is a ... test|
00000020  0d 0a 41 6e 6f 74 68 65  72 20 c2 a9 20 6c 69 6e  |..Another .. lin|
00000030  65 0d 0a 31 20 32 20 33  20 34 0d 0a 35 20 36 20  |e..1 2 3 4..5 6 |
00000040  37 20 38 0d 0a                                    |7 8..|
00000045

Please bear in mind that this is just a quick demo. You may want a more sophisticated way to handle newlines, eg you probably don't want to replace \\n if it's already preceded by \\r . OTOH, with normal Python text handling that shouldn't be an issue...

Here's yet another version which combines the 2 previous strategies. I don't know if it's any faster than the second version.

import sys
import io

class Encoder(object):
    def __init__(self, fname):
        self.file = io.open(fname, mode='wt', encoding='utf-8', newline='\r\n')

    def write(self, s):
        self.file.write(unicode(s))

errlog = Encoder('test.log')
sys.stdout = errlog
sys.stderr = errlog

print 'hello\nthere'
print >>sys.stderr, u'This is a ™ test'
print u'Another © line'
print >>sys.stderr, 1, 2, 3, 4
print 5, 6, 7, 8

This produces the same output as the previous version.

Redirect stdout to a file with unicode encoding while keeping windows eol in python 2

Question

2 answers

solution1
4 ACCPTED 2016-12-06 05:01:39

Option 1

test.bat

test.py

out.txt (encoded in UTF-8)

Hex dump of out.txt

Option 2

solution2
1 2016-12-05 08:20:01

Redirect stdout to a file with unicode encoding while keeping windows eol in python 2

Question

2 answers

solution1 4 ACCPTED 2016-12-06 05:01:39

Option 1

test.bat

test.py

out.txt (encoded in UTF-8)

Hex dump of out.txt

Option 2

solution2 1 2016-12-05 08:20:01

solution1
4 ACCPTED 2016-12-06 05:01:39

solution2
1 2016-12-05 08:20:01