简体   繁体   中英

UTF-8 won't persist on Hibernate + MySQL

I'm trying to save some values in MySQL database by using Hibernate, but most Lithuanian characters won't get saved, including ąĄ čČ ęĘ ėĖ įĮ ųŲ ūŪ (they are saved as ? ), however, šŠ žŽ do get saved.

If I do inserts manually, then those values are properly saved, so the problem is most likely in Hibernate configuration.

What I have tried so far:

hibernate.charset=UTF-8
hibernate.character_encoding=UTF-8
hibernate.use_unicode=true

---------

properties.put(PROPERTY_NAME_HIBERNATE_USE_UNICODE,
            env.getRequiredProperty(PROPERTY_NAME_HIBERNATE_USE_UNICODE));
    properties.put(PROPERTY_NAME_HIBERNATE_CHARSET,
            env.getRequiredProperty(PROPERTY_NAME_HIBERNATE_CHARSET));
    properties
            .put(PROPERTY_NAME_HIBERNATE_CHARACTER_ENCODING,
                    env.getRequiredProperty(PROPERTY_NAME_HIBERNATE_CHARACTER_ENCODING));

---------

private void registerCharachterEncodingFilter(ServletContext aContext) {
    CharacterEncodingFilter cef = new CharacterEncodingFilter();
    cef.setForceEncoding(true);
    cef.setEncoding("UTF-8");
    aContext.addFilter("charachterEncodingFilter", cef)
            .addMappingForUrlPatterns(null, true, "/*");
}

As described here

I tried adding ?useUnicode=true&characterEncoding=utf-8 to db connection url.

As described here

I ensured that my db is set to UTF-8 charset. phpmyadmin > information_schema > schemata

def db_name utf8 utf8_lithuanian_ci NULL

This is how I save into db:

//Controller
buildingService.addBuildings(schema.getBuildings());
        List<Building> buildings = buildingService.getBuildings();
        System.out.println("-----------");
        for (Building b : schema.getBuildings()) {
            System.out.println(b.toString());
        }
        System.out.println("-----------");
        for (Building b : buildings) {
            System.out.println(b.toString());
        }
        System.out.println("-----------");

//Service:
@Override
public void addBuildings(List<Building> buildings) {
    for (Building b : buildings) {
        getCurrentSession().saveOrUpdate(b);
    }
}

First set of println contains all Lithuanian characters, while second replaces most with ?

EDIT: Added details

insert into buildings values (11,'ąĄčČęĘ', 'asda');    
select short, hex(short) from buildings;
//Šalt. was inserted via hibernate
//letters are properly displayed:
ąĄčČęĘ       | C485C484C48DC48CC499C498
MIF Šalt.    | 4D494620C5A0616C742E  

select address, hex(address) from buildings;
 Šaltini? <...> | C5A0616C74696E693F20672E2031412C2056696C6E697573
//should contain "ų"
--------
show create table buildings;
buildings | CREATE TABLE `buildings` (
  `id` int(11) NOT NULL,
  `short` varchar(255) COLLATE utf8_lithuanian_ci DEFAULT NULL,
  `address` varchar(255) COLLATE utf8_lithuanian_ci DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_lithuanian_ci 

EDIT: I did not find a proper solution, so I came up with a workaround. I ended up escaping/unescaping characters, storing them like this: \\uXXXX .

Let's verify that they were stored correctly... Please do SELECT col, HEX(col) ... to fetch some cell with Lithuanian characters. A correctly stored ą will show C485 . The others should show various hex values of C4xx or C5xx. 3F is ? .

But, more importantly, 4 characters do show. Š should be C5A0 if properly stored as utf8. However, I suspect, you will see 8A , implying that the column in the table is really declared as CHARACTER SET latin1 . (The 4 characters show up in the first column of my charset blog ).

Do SHOW CREATE TABLE to see how the column is defined. If it says latin1 , then the problem is with the table definition, and you probably ought to start over.

You have to ensure that every component taking part in data entry uses UTF-8 encoding explicitly.

  • If you enter the values via browser, make sure that the page displaying the results with the following header Content-Type: text/html; charset=utf-8 Content-Type: text/html; charset=utf-8 .

  • The input form is defined as follows

    <form action="submit" accept-charset="UTF-8">...</form> .

  • If you are creating String objects from byte array, make sure you explicitly state the Charset in the constructor.

  • If your entry happens from a text file, that file has to be UTF-8 encoded.

  • If it is hardcoded directly in your code, then the source has to be UTF-8 encoded.

The fact that your DB holds correct UTF-8 (two or more bytes for a special letter) is reassuring.

If you get one single ? for a special letter, it was attempted to do a UTF-8 conversion to some encoding that does not contain those letters. And that seems to be the case . The letters that are converted correctly are in the ISO-8859-1 or Windows-1252 range. The others are not. Now ISO-88591-1 aka Latin-1 is the default HTTP encoding, default in java EE server. You might like to do before writing:

response.setCharacterEncoding("UTF-8");

Now one problem with System.out.println is that it uses the system default encoding. Logging to a file with a logger is more interesting. Or debugging and inspecting the String and its char array.

That the schema does seemingly work, may be that the schema Strings stem immediately from a Java source, and the editor encoding and javac compiler encoding differ. This can be checked by u-escaping the string literals in java: "\ą" instead of "ą" .

Make a unit test that writes and reads from the database.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM