简体   繁体   中英

Converting a Postgresql database from SQL_ASCII, containing mixed encoging types, to UTF-8

I have a postgresql database I would like to convert to UTF-8.

The problem is that it is currently SQL_ASCII, so hasn't been doing any kind of encoding conversion on its input, and as such has ended up with data of a mix of encoding types in the tables. One row might contain values encoded as UTF-8, another might be ISO-8859-x, or Windows-125x, etc.

This has made performing a dump of the database, and converting it to UTF-8 with the intention of importing it into a fresh UTF-8 database, difficult. If the data were all of one encoding type, I could just run the dump file through iconv, but I don't think that approach works here.

Is the problem fundamentally down to knowing how each data is encoded? Here, where that is not known, can it be worked out, or even guessed? Ideally I'd love a script which would take a file, any file, and spit out valid UTF-8.

This is exactly the problem that Encoding::FixLatin was written to solve*.

If you install the Perl module then you'll also get the fix_latin command-line utility which you can use like this:

pg_restore -O dump_file | fix_latin | psql -d database

Read of the ' Limitations ' section of the documentation to understand how it works.

[*] Note I'm assuming that when you say ISO-8859-x you mean ISO-8859-1 and when you say CP125x you mean CP1252 - because the mix of ASCII, UTF-8, Latin-1 and WinLatin-1 is a common case. But if you really do have a mixture of eastern and western encodings then sorry but you're screwed :-(

It is impossible without some knowledge of the data first. Do you know if it is a text message or people's names or places? In some particular language?

You can try to encode a line of a dump and apply some heuristic — for example try an automatic spell checker and choose an encoding that generates the lowest number of errors or the most known words etc.

You can use for example aspell list -l en ( en for English, pl for Polish, fr for French etc.) to get a list of misspelled words. Then you can choose encoding which generates the least of them. You'd need to install corresponding dictionary package, for example "aspell-en" in my Fedora 13 Linux system.

I resolved using this commands;

1-) Export

pg_dump --username=postgres --encoding=ISO88591 database -f database.sql

and after

2-) Import

psql -U postgres -d database < database.sql

these commands helped me solve the problem of conversion SQL_ASCII - UTF-8

I've seen exactly this problem myself, actually. The short answer: there's no straightforward algorithm. But there is some hope.

First, in my experience, the data tends to be:

  • 99% ASCII
  • .9% UTF-8
  • .1% other, 75% of which is Windows-1252.

So let's use that. You'll want to analyze your own dataset, to see if it follows this pattern. (I am in America, so this is typical. I imagine a DB containing data based in Europe might not be so lucky, and something further east even less so.)

First, most every encoding out there today contains ASCII as a subset. UTF-8 does, ISO-8859-1 does, etc. Thus, if a field contains only octets within the range [0, 0x7F] (ie, ASCII characters), then it's probably encoded in ASCII/UTF-8/ISO-8859-1/etc. If you're dealing with American English, this will probably take care of 99% of your data.

On to what's left.

UTF-8 has some nice properties, in that it will either be 1 byte ASCII characters, OR everything after the first byte will be 10xxxxxx in binary. So: attempt to run your remaining fields through a UTF-8 decoder (one that will choke if you give it garbage.) On the fields it doesn't choke on, my experience has been that they're probably valid UTF-8. (It is possible to get a false positive here: we could have a tricky ISO-8859-1 field that is also valid UTF-8.)

Last, if it's not ASCII, and it doesn't decode as UTF-8, Windows-1252 seems to be the next good choice to try. Almost everything is valid Windows-1252 though, so it's hard to get failures here.

You might do this:

  • Attempt to decode as ASCII. If successful, assume ASCII.
  • Attempt to decode as UTF-8.
  • Attempt to decode as Windows-1252

For the UTF-8 and Windows-1252, output the table's PK and the "guess" decoded text to a text file (convert the Windows-1252 to UTF-8 before outputting). Have a human look over it, see if they see anything out of place. If there's not too much non-ASCII data (and like I said, ASCII tends to dominate, if you're in America...), then a human could look over the whole thing.

Also, if you have some idea about what your data looks like, you could restrict decodings to certain characters. For example, if a field decodes as valid UTF-8 text, but contains a "©", and the field is a person's name, then it was probably a false positive, and should be looked at more closely.

Lastly, be aware that when you change to a UTF-8 database, whatever has been inserting this garbage data in the past is probably still there: you'll need to track down this system and teach it character encoding.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM