简体   繁体   中英

Reading unicode characters from file/sqlite database and using it in Python

I have a list of variables with unicode characters, some of them for chemicals like Ozone gas: like 'O\₃'. All of them are stored in a sqlite database which is read in a Python code to produce O 3 . However, when I read I get 'O\\\₃'. The sqlite database is created using an csv file that contains the string 'O\₃' among others. I understand that \₃ is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be \\,u,2,0,8,3). Is there any way to recognize unicode characters in this context? Now my first option to solve it is to create a function to recognize set of characters and replace for unicode characters. Is there anything like this already implemented?

It's important to remember everything is bytes. To pull bytes into something useful to you, you kind of have to know what encoding is used when you pull in data. There are too many ambiguous cases to determine encoding by analyzing the data. When you send data out of your program, it's all back out to bytes again. Depending on whether you're using Python 2.x or 3.x you'll have a very different experience with Unicode and Python.

You can, however attempt encoding and simply do a "replace" on errors. For example the_string.encode("utf-8","replace") will try to encode as utf-8 and will replace problems with a ? -- You could also anticipate problem characters and replace them beforehand, but that gets unmanageable quickly. Take a look at codecs classes for more replacement options.

If you have a byte string (length 7), decode the Unicode escape.

>>> s = 'O\u2083'
>>> len(s)
7
>>> s
'O\\u2083'
>>> print(s)
O\u2083
>>> u = s.decode('unicode-escape')
>>> len(u)
2
>>> u
u'O\u2083'
>>> print(u)
O₃

Caveat: Your console/IDE used to print the character needs to use an encoding that supports the character or you'll get a UnicodeEncodeError when printing. The font must support the symbol as well.

SQLite allows you to read/write Unicode text directly. u'O\₃' is two characters u'O' and u'\₃' (your question has a typo: 'u\\2083' != '\₃' ).

I understand that u\\2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be u,\\,2,0,8,3)

Don't confuse u'u\\2083' and u'\₃' : the latter is a single character while the former is 4-character sequence: u'u' , u'\\x10' ( '\\20' is interpreted as octal in Python), u'8' , u'3' .

If you save a single Unicode character u'\₃' into a SQLite database; it is stored as a single Unicode character (the internal representation of Unicode inside the database is irrelevant as long as the abstraction holds).

On Python 2, if there is no from __future__ import unicode_literals at the top of the module then 'abc' string literal creates a bytestring instead of a Unicode string -- in that case both 'u\\2083' and '\₃' are sequences of bytes , not text characters ( \\uxxxx is not recognized as a unicode escape sequence inside bytestrings).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM