JSON contains incorrect UTF-8 \u00ce\u00b2 instead of Unicode \u03b2, how to fix in Python?

Question

First note that symbol β (Greek beta) have hex representation in UTF-8: CE B2

I have legacy source code in Python 2.7 that uses json strings:

u'{"something":"text \\u00ce\\u00b2 text..."}'

I then it calls json.loads(string) or json.loads(string, 'utf-8'), but the result is Unicode string with UTF-8 characters:

u'text \xce\xb2 text'

What I want is normal Python Unicode (UTF-16?) string:

u'text β text'

If I call:

text = text.decode('unicode_escape')

before json.loads, then I got correct Unicode β symbol, but it also breaks json by also replacing all new lines - \\n

The question is, how to convert only "\\\Î\\\\00b2" part without affecting other json special characters?

(I am new to Python, and it is not my source code, so I have no idea how this is supposed to work. I suspect that the code only works with ASCII characters)

Answer 1

Something like this, perhaps. This is limited to 2-byte UTF-8 characters.

import re

j = u'{"something":"text \\u00ce\\u00b2 text..."}'

def decodeu (match):
    u = '%c%c' % (int(match.group(1), 16), int(match.group(2), 16))
    return repr(u.decode('utf-8'))[2:8]

j = re.sub(r'\\u00([cd][0-9a-f])\\u00([89ab][0-9a-f])',decodeu, j)

print(j)

returns {"something":"text \β text..."} for your sample. At this point, you can import it as regular JSON and get the final string you want.

result = json.loads(j)

Answer 2

Here's a string-fixer that works after loading the JSON. It handles any length UTF-8-like sequence and ignores escape sequences that don't look like UTF-8 sequences.

Example:

import json
import re

def fix(bad):
    return re.sub(ur'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'),bad)

# 2- and 3-byte UTF-8-like sequences and onen correct escape code.
json_text = '''\
{
  "something":"text \\u00ce\\u00b2 text \\u00e4\\u00bd\\u00a0\\u597d..."
}
'''

data = json.loads(json_text)
bad_str = data[u'something']
good_str = fix(bad_str)
print bad_str
print good_str

Output:

text Î² text ä½ 好...
text β text 你好...

JSON contains incorrect UTF-8 \u00ce\u00b2 instead of Unicode \u03b2, how to fix in Python?

Question

2 answers

solution1
1 ACCPTED 2017-12-06 13:02:32

solution2
1 2017-12-06 18:31:39

JSON contains incorrect UTF-8 \u00ce\u00b2 instead of Unicode \u03b2, how to fix in Python?

Question

2 answers

solution1 1 ACCPTED 2017-12-06 13:02:32

solution2 1 2017-12-06 18:31:39

solution1
1 ACCPTED 2017-12-06 13:02:32

solution2
1 2017-12-06 18:31:39