简体   繁体   中英

Python decoding of back quotations

I am receiving this issue
" UnicodeEncodeError: 'latin-1' codec can't encode character u'\”' "

I'm quite new to working with databases as a whole. Previously, I had been using SQLite3; however, now transitioning/migrating to MySQL, I noticed u'\”' and u'\“' characters were within some of my text data.

I'm currently making a python script to tackle the migration; however, I'm getting stuck with this codec issue that I previously didn't for see.

So my question is, how do I replace/decode these values so that I can actually store them in MySQL DB?

You don't have a problem decoding these characters; wherever they're coming from, if they're showing up as \” ( ) and \“ ( ), they're already being properly decoded.

The problem is encoding these characters. If you want to store your strings in Latin-1 columns, they can only contain the 256 characters that exist in Latin-1, and these two are not among them.


So my question is, how do I replace/decode these values so that I can actually store them in MySQL DB?

The obvious solution is to use UTF-8 columns instead of Latin-1 in MySQL. Then this problem wouldn't even exist; any Unicode string can be encoded as UTF-8.


But assuming you can't do that for some reason…

Python comes with built-in support for different error handlers that can help you do something with these characters while encoding them. You just have to decide what "something" that is.

Let's say your string looks like hey “hey” hey . Here's what each error handler would do with it:

  • s.encode('latin-1', 'ignore') : hey hey hey
  • s.encode('latin-1', 'replace') : hey ?hey? hey hey ?hey? hey
  • s.encode('latin-1', 'xmlcharrefreplace'): hey “hey” hey`
  • s.encode('latin-1', 'backslashreplace'): hey \“hey\” hey`

The first two have the advantage of being somewhat readable, but the disadvantage that you can never recover the original string. If you want that, but want something even more readable, you may want to consider a third-party library like unidecode :

  • unidecode('hey “hey” hey').encode('latin-1'): hey "hey" hey`

The last two are lossless, but kind of ugly. Although in some contexts they'll look pretty nice—eg, if you're building an XML document, xmlcharrefreplace (maybe even with 'ascii' instead of 'latin-1' ) will give you exactly what you want in an XML viewer. There are special-purpose translators for various other use cases (like HTML references, or XML named entities instead of numbered, etc.) if you know what you want.

But in general, you have to make the choice between throwing away information, or "hiding" it in some ugly but recoverable form.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM