u'Too' u'much' u'unicode' u'returned'

Question

I have an api which I'm putting things into and out of in a natural language processing context, using json.

Everything is coming out as unicode. For example, if retrieve a list of words from my api, every single word is u''. This is what the json output looks like after printing to a file:

{u'words': [u'every', u'single', u'word']}

I must clarify that in the terminal everything looks good, just not when I print the output to a file.

I haven't figured out yet if this is preferable default behavior or if I need to do something along the way to make this plain, or what. The outputs are going to used with languages other than python, other contexts where they need to be readable and/or parseable.

So clearly I don't have a grasp on python & unicode and how and where this is being.

Is this preferable when dealing with json? Should I not worry about it?
How I turn this off, or how do I take an extra step (I've already tried but can't figure out exactly where this is doing this) to make this less of a nuisance.

I have much to learn, so any input is appreciated.

EDIT: all the input has been useful, thank you.

I was under the mistaken notion that jsonify did more than it actually does I guess. If I do json.dumps earlier in my task chain, I get actual json on the other end.

Answer 1

There is nothing wrong with this, and you don't need to do anything about it.

In Python 2, a str is similar to a C string - it's just a sequence of bytes, sometimes incorrectly assumed to be ASCII text. It can contain encoded text, eg as UTF-8 or ASCII.

The unicode type represents an actual string of text, similar to a Java String . It is text in the abstract sense, not tied to a particular encoding. You can decode a str into unicode , or encode a unicode into a str .

JSON keys and values are strings - they are not byte arrays, but text - so they are represented by unicode objects in Python.

If you need JSON output for use in another language, use the json module to produce it from your dictionary:

>>> import json
>>> print json.dumps({u'words': [u'every', u'single', u'word']})
{"words": ["every", "single", "word"]}

Answer 2

It is preferable, yes, since JSON is defined to be unicode.

If you have more specific things that are causing you trouble you should share them, otherwise I'd recommend watching Ned Batchelder's Intro if you're just generally uncomfortable with Unicode (in Python in particular). I don't know what is causing this to be a nuisance to you, since I don't know what you're doing with this dict.

Answer 3

You should keep everything internal to python in unicode if there's any chance you will need it. Where python speaks to other programs, use s.encode('UTF-8') to make a regular string that you can write to a file or socket or whatever. Use s.decode('UTF-8') to convert a string from a file/socket back to unicode. (UTF-8 seems like a reasonable default, but use whatever your protocol specifies.)

u'Too' u'much' u'unicode' u'returned'

Question

3 answers

solution1
4 ACCPTED 2012-08-30 23:49:08

solution2
3 2012-08-30 23:44:32

solution3
1 2012-08-30 23:43:18

u'Too' u'much' u'unicode' u'returned'

Question

3 answers

solution1 4 ACCPTED 2012-08-30 23:49:08

solution2 3 2012-08-30 23:44:32

solution3 1 2012-08-30 23:43:18

solution1
4 ACCPTED 2012-08-30 23:49:08

solution2
3 2012-08-30 23:44:32

solution3
1 2012-08-30 23:43:18