简体   繁体   中英

SQLite unicode slavic accented words Android

I'm trying to filter out accented words if user searches for them in local database. But I have problems, namely with slavic letters ČŠŽ. In my SQLite database I have a field "title" with value: "Želodček"

If I try to select LOWER(title) I always get back the same value "Želodček" whilst other words are correctly lower cased. Only if the word begins with ČŽŠ then it doesn't get lower cased. This only persists with words which have leading accented letters.

Database records

Stomach
Želodček

Uppercase with UPPER()

STOMACH
ŽELODčEK

Lowercase with LOWER()

stomach
Želodček

I've already tried setting localization with setLocale() with no luck. I also tried different collation like NOCASE, UNICODE, LOCALIZED but nothing worked. I'm wondering why when lower cased the first letter is not lower cased and when upper cased other accented words are lowercase.

I've solved the problem with LIKE searches where I replace accented words with their lower cased counterpart. But I have problem with full text(FTS3) searching because I can't use the same trick with MATCH.

 -- works but it's a hack
 SELECT title FROM articles WHERE REPLACE(LOWER(title),'Ž','ž') LIKE '%želodček%'
 -- can't seem to get it work
 SELECT title FROM articles WHERE title MATCH 'želodček' COLLATE NOCASE 

Is there any solution to this or is there a bigger problem?

Update: No optimal solution yet.

Un-optimal solution 1: I decided to deal with the problem directly by changing data in the select query. While this doesn't work for all cases (and I would have to cover all accents) it suits my case for now. So I'm posting it:

-- LIKE query
SELECT title FROM articles WHERE (REPLACE(REPLACE(REPLACE(LOWER(title),'Č','č'),'Š','š'),'Ž','ž') LIKE ? COLLATE NOCASE))

-- MATCH query (FTS)
-- In this case I programmatically replace searched word with 2 word variation (one that starts with lowercase and one that starts with uppercase) ie: title='želodček OR Želodček'
SELECT title FROM articles WHERE title MATCH ? COLLATE UNICODE

Un-optimal solution 2: As suggested by user CL. to insert in normalized form (didn't work for me because normalized form was basically the original unicode form). I took it futher and insert title stripped of of accents (basically ASCII form). This is maybe better than solution one in ways of general solution. Since I only cover some accents in the first. But there are downsides:

  • data doubles (one unicode title and one ASCII title). Which can be a problem if you have a lot of data.
  • some characters are not supported (like chinese characters will be gone after normalization and stripping)
  • ambiguity which you get by stripping accents (ie. two words "zelo" and "želo" have different meanings but will both turn up when searching).

Here's the Java code for it:

// Gets you the ASCII version of unicode title which you insert into different column
String titleAsciiName = Normalizer.normalize(title, Normalizer.Form.NFD)
    .replaceAll("[^\\p{ASCII}]", "");

LIKE never uses a custom collation .

FTS can use a custom tokenizer , but you have to check whether unicode61 is available in all Android versions you want to support.


The Android database API does not allow to create custom implementations of LIKE or of a FTS tokenizer. You might want to store a normalized version of your strings in the database.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM