简体   繁体   中英

Skip rows with non-BMP characters (emojis) when using LOAD DATA INFILE with REPLACE option?

Emoji characters are messing up a loading system we built and I'm looking for a simple short-term solution.

Its a Java loading program that uses JDBC to execute MySQL commands with this structure:

LOAD DATA
  LOCAL INFILE `filepath` 
  REPLACE INTO TABLE `SOME_TABLE`
  CHARACTER SET utf8
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\'' ESCAPED BY ''
  LINES TERMINATED BY '\n'
(`col1`,...,`coln`)

SOME_TABLE has ENGINE=InnoDB DEFAULT CHARSET=utf8 .

We are running MySQL 5.6.22.

Its been working great for years, but recently the files that we load started having occasional non-BMP characters (that happen to be emojis) and the LOAD DATA LOCAL INFILE ... command throws exceptions like:

java.sql.SQLException: Incorrect string value: '\xF0\x9D\x93\x9C' for column 'fieldm' at row 3004

I understand that the long-term solution is we need to move the table to CHARSET=utf8mb4 . However, the tables are huge at this point and conversion will not be easy. There are also VARCHAR(255) fields indexed, and these need to be converted to VARCHAR(191) [to fit under max key length 767], or we need to go to DYNAMIC row format and set innodb_large_prefix=true .

We are looking for a short-term solution until we get to a point where we have time and resources to migrate to utfmb4.

It would be OK, in the short term, to simply discard the rows with non-BMP (emoji) characters. But, LOAD DATA LOCAL INFILE filepath REPLACE ... will not skip the bad rows, it fails the entire file.

At this point, it looks like we will need to write some filtering in Java to remove the non-BMP (emoji) rows before calling LOAD DATA LOCAL INFILE filepath REPLACE ... . But, I am thinking that there must be some way to do this in MySQL without having to introduce that kind of pre-filter.

Does anybody have any ideas for a simple way to get MySQL to simply skip the rows that have non-BMP (emoji) data?

***** UPDATE ***** It looks like using CONVERT might be the solution for short term. Doing this replaces the Emoji with '????' in col4.

LOAD DATA
  LOCAL INFILE `filepath` 
  REPLACE INTO TABLE `SOME_TABLE`
  CHARACTER SET utf8
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\'' ESCAPED BY ''
  LINES TERMINATED BY '\n'
(`col1`,`col2`,`col3`,@q, ...,  `coln`)
  SET `col4` = CONVERT(CONVERT(@q USING utf8mb4) USING utf8);

Does anybody see a problem with that?

In order to store Emoji, you must use utf8mb4, not utf8 throughout.

A shortcut (perhaps) for the 191 index issue is to upgrade to 5.7. There, you can keep 255 and have indexes.

Only certain columns will be holding Emoji, correct? Convert just those columns. (It is OK for different columns in the same table to have different charset and/or collation.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM