Reading CJK characters from an input file in C

Question

I have a text file which can contain a mix of Chinese, Japanese, Korean (CJK) and English characters. I have to validate the file for English characters. The file can be allowed to contain CJK characters only when a line begins with the '$' character, which represents a comment in my text file. Searching through the net, I found out that I can use fgetws() and the wchar_t type to read wide chars.

Q1) But I am wondering how CJK characters would be stored in my text file - what byte order etc.

Q2) How can I loop through CJK characters. Since Unicode characters can have 1 to 6 bytes, I cannot use i++.

Any help would be appreciated.

Thanks a lot.

Answer 1

You need to read the UTF-8 file as a sequence of UTF-32 codepoints. For example:

std::shared_ptr<FILE> f(fopen(filename, "r"), fclose);
uint32_t c = 0;
while (utf8_read(f.get(), c))
{
    if (is_english_char(c))
        ...
    else if (is_cjk_char(c))
        ...
    else
        ...
}

Where utf8_read has the signature:

bool utf8_read(FILE *f, uint32_t &c);

Now, utf8_read may read 1-4 bytes depending on the value of the first byte. See http://en.wikipedia.org/wiki/UTF-8 , google for an algorithm or use a library function already available to you.

With the UTF-32 codepoint, you can now check ranges. For English, you can check if it is ASCII ( c < 0x7F ) or if it is a Latin character (Including support for accented characters for imported words from eg French). You may also want to exclude non-printable control characters (eg 0x01 ).

For the Latin and/or CJK character checks, you can check if the character is in a given code block (see http://www.unicode.org/Public/UNIDATA/Blocks.txt for the codepoint ranges). This is the simplest approach.

If you are using a library with Unicode support that has writing script detection (eg the glib library), you can use the script type to detect the characters. Alternatively, you can get the data from http://www.unicode.org/Public/UNIDATA/Scripts.txt :

Name     : Code      : Language(s)
=========:===========:========================================================
Common   : Zyyy      : general punctuation / symbol characters
Latin    : Latn      : Latin languages (English, German, French, Spanish, ...)
Han      : Hans/Hant : Chinese characters (Chinese, Japanese)
Hiragana : Hira      : Japanese
Katakana : Kana      : Japanese
Hangul   : Hang      : Korean

NOTE: The script codes come from http://www.iana.org/assignments/language-subtag-registry ( Type == 'script' ).

Answer 2

You need to understand UTF-8 and use some UTF8 handling library (or code your own). FYI, Glib (from GTK) has UTF-8 handling functions, which are able to deal with variable-length UTF-8 chars & strings. There are other UTF-8 libraries eg iconv - inside GNU libc - and ICU and many others.

UTF-8 does define the byte order and content of multi-byte UTF8 characters, eg Chinese ones.

Answer 3

I am pasting a sample program to illustrate wchar_t handling. Hope it helps someone.

#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#define BUFLEN 1024
int main() {
  wchar_t *wmessage=L"Lets- beginめん（下）　震災後、保存-食で-脚光-（経済ナビゲーター）-lets- end";
  wchar_t warray[BUFLEN + 1];
  wchar_t a = L'z';
  int i=0;
  FILE *fp;
  wchar_t *token = L"-";
  wchar_t *state;
  wchar_t *ptr;
  setlocale(LC_ALL, "");
  /* FIle in current dirrctory containing CJK chars */
  fp = fopen("input", "r");
  if (fp == NULL) {
      printf("%s\n", "Cannot open file!!!");
      return (-1);
  }
  fgetws(warray, BUFLEN, fp);
  wprintf(L"\n *********************START reading from file*******************************\n");
  wprintf(L"%ls\n",warray);
  wprintf(L"\n*********************END reading from file*******************************\n");
  fclose(fp);
  wprintf(L"printing character %lc = <0x%x>\n", a, a);
  wprintf(L"\n*********************START Checking string for Japanese*******************************\n");
  for(i=0;wmessage[i] != '\0';i++) {
      if (wmessage[i] > 0x7F) {
          wprintf(L"\n This is non-ASCII <0x%x> <%lc>", wmessage[i],  wmessage[i]);
      } else {
          wprintf(L"\n This is ASCII <0x%x> <%lc>", wmessage[i],  wmessage[i]);
      }
  }
  wprintf(L"\n*********************END Checking string for Japanese*******************************\n");
  wprintf(L"\n*********************START Tokenizing******************************\n");
  state = wcstok(warray, token, &ptr);
  while (state != NULL) {
      wprintf(L"\n %ls", state);
      state = wcstok(NULL, token, &ptr);
  }
  wprintf(L"\n*********************END Tokenizing******************************\n");
  return 0;
}

Reading CJK characters from an input file in C

Question

3 answers

solution1
1 2012-10-09 12:08:44

solution2
0 2012-10-08 05:47:36

solution3
0 2012-10-08 09:17:05

Reading CJK characters from an input file in C

Question

3 answers

solution1 1 2012-10-09 12:08:44

solution2 0 2012-10-08 05:47:36

solution3 0 2012-10-08 09:17:05

solution1
1 2012-10-09 12:08:44

solution2
0 2012-10-08 05:47:36

solution3
0 2012-10-08 09:17:05