简体   繁体   中英

Does POSIX regex.h provide unicode or basically non-ascii characters?

Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.

Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.

My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).

Any help would be appreciated..

Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]" (for example). And everything working just fine.

Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale() before it.

#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>

int main(int argc, char** argv) {
  int ret;
  regex_t reg;
  regmatch_t matches[10];

  if (argc != 3) {
    fprintf(stderr, "Usage: %s regex string\n", argv[0]);
    return 1;
  }

  setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */

  if ((ret = regcomp(&reg, argv[1], 0)) != 0) {
    char buf[256];
    regerror(ret, &reg, buf, sizeof(buf));
    fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
    return 1;
  }

  if ((ret = regexec(&reg, argv[2], 10, matches, 0)) == 0) {
    int i;
    char buf[256];
    int size;
    for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
      if (matches[i].rm_so == -1) break;
      size = matches[i].rm_eo - matches[i].rm_so;
      if (size >= sizeof(buf)) {
        fprintf(stderr, "match (%d-%d) is too long (%d)\n",
                matches[i].rm_so, matches[i].rm_eo, size);
        continue;
      }
      buf[size] = '\0';
      printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
             strncpy(buf, argv[2] + matches[i].rm_so, size));

    }
  }

  return 0;
}

Usage example:

$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$

The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.

Basically, POSIX regexes are not Unicode aware. You can try to use them on Unicode characters, but there might be problems with glyphs that have multiple encodings and other such issues that Unicode aware libraries handle for you.

From the standard, IEEE Std 1003.1-2008 :

Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character. This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text encoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that is required, the user can specify equivalence classes containing all variations of the desired graphic symbol.

Maybe libpcre would work for you? It's slightly heavier than POSIX regexes, but I would think it lighter than ICU or Boost.

如果你的意思是“标准”,即来自C ++ 11的std::regex ,那么你需要做的就是切换到std::wregex (当然还有std::wstring )。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM