简体   繁体   中英

How to determine if character is Chinese, Korean or Japanese

I have strings that come from database. Each string is either English (ASCII) or Chinese, Korean or Japanese.

I need to detect and delete all Chinese strings,
all English, Korean and Japan strings must be kept.

Is this possible? I know Japanese text might use Chinese symbols.

I am using PHP.

Update:

I do not try to detect the language. Detecting encoding will be enough. However I am unsure about difference between Chinese and Japanese - Do they use same encoding or different encoding.

Let's clear up some terms first:

A language is a human language, like English, Chinese, Korean or Japanese. Languages are written using writing systems consisting of characters/ideographs/letters. Several languages share writing systems; you can use the Latin alphabet to write a whole bunch of different languages like English, French, German etc. Those writing systems are encoded in a computer using an encoding , which makes it possible to express individual characters using only binary notation (1s and 0s).

Now:

  • Japanese partly shares its writing system with Chinese; Japanese uses Chinese characters ( kanji ) in addition to some exclusively Japanese characters ( hiragana, katakana ). The Latin alphabet is also used in Japanese.
  • Chinese characters are also partially used in Korean writing , though Korean can be written exclusively in hangul .
  • Either and all of these languages can be encoded in various ways; there are encodings which are primarily intended for Chinese or Korean or Japanese , but the most widely used Unicode encodings (eg UTF-8 ) can express all those languages in the same encoding and are not biased towards any one particular language.

Given all this, what you want is somewhere between unclear and impossible. You could remove all Chinese characters from a text (remove any character that is used in Chinese), but in the case of Japanese that would also mean largely removing the Japanese text (less so for Korean, but same issue). It would be like removing Latin letters from an English text; there's not a lot left if you do that. You could try to detect whether some text is encoded in some encoding primarily biased towards one specific language, but if your text is encoded in a Unicode encoding there's nothing there to differentiate. You could try language analysis to detect the language used in your text, but you stated that you do not want to detect "languages".

You could try to detect whether some specifically Korean (hangul) or Japanese (kana) characters are in a string, that'd be a good indication that the text is likely in one of those languages. However, you will get false negatives in the case of Japanese, since it's perfectly possible for a short phrase to contain exclusively Chinese characters and still be valid Japanese.

The only advice I can give with the stated question is to go back to the drawing board to figure out what exactly you want to do.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM