简体   繁体   English

字符解码转换功能实现

[英]Character decoding Conversion Function Implementation

I need to implement a character encoding conversion function in C++ or C( Most desired ) from a custom encoding scheme( to support multiple languages in single encoding ) to UTF-8. 我需要在C ++或C(最需要的)中实现从自定义编码方案(以单一编码支持多种语言)到UTF-8的字符编码转换功能。

Our encoding is pretty random , it looks like this 我们的编码是随机的, 看起来像这样

Because of the randomness of this mapping, I am thinking to use std::map for mapping our encoding to UTF and vice versa in two different maps ,and use this maps for conversion. 由于此映射的随机性,我正在考虑使用std :: map在两个不同的映射中将我们的编码映射到UTF,反之亦然,并使用此映射进行转换。 Is their any optimized data structure or way to do it. 是他们进行任何优化的数据结构或方式。

If your code points are contiguous, just make a big char * array and translate using that. 如果您的代码点是连续的,则只需创建一个大char *数组并使用该数组进行翻译即可。 I don't really understand what you mean by UTF-8 codepoint. 我不太了解UTF-8代码点的含义。 UTF-8 has representations, and Unicode has codepoints. UTF-8具有表示形式,而Unicode具有代码点。 If you want code points, use an array of ints. 如果需要代码点,请使用一个整数数组。

const int mycode_to_unicode [] = {
   0x00ff,
   0x0102,
   // etc.
 };

You could put a value like -1 if there are holes in your encoding to catch errors. 如果编码中存在漏洞以捕获错误,则可以输入类似-1的值。

Going the other way is just making an array of structs of the same size of something like 换种方式只是制作一个大小相同的结构数组

struct {
   int mycode;
   int unicode;
};

copying the keys of the array into mycode and the values into unicode, and running it through qsort with a function which compares the values of unicode , then using bsearch with the same function to go from code point to your encoding. 将数组的键复制到mycode中,然后将值复制到unicode ,然后使用带有比较unicode值的函数的qsort运行它,然后将bsearch与同一个函数一起使用,从代码点转到您的编码。

This is assuming you want to use C. 这是假设您要使用C。

An hashtable would surely be the fastest solution. 哈希表肯定是最快的解决方案。

If a table is known upfront and never changes (as I understand it's the case), you can determine a perfect hash for it meaning that you will have no collision and assured costant retrieve time (at the expense of possibily some space). 如果一个表是预先知道的并且永远不会改变(据我所知是这样),则可以为其确定一个完美的哈希 ,这意味着您将不会发生冲突并且可以保证代价高昂的检索时间(这可能会浪费一些空间)。

I've used gperf a couple of times but I suggest you to check Bob Jenkins great page on hashing (and minimal perfect hashing as well) 我已经使用过gperf几次,但我建议您检查Bob Jenkins关于哈希的出色页面(以及最小完美哈希

As you build the constant mappings upfront and use it only for lookups, a hash table might be more ideal than std::map. 当您预先构建常量映射并将其仅用于查找时,哈希表可能比std :: map更理想。 There is no hash table implementation in the C++ standard, but many free implementations are available, both in C and C++. C ++标准中没有哈希表实现,但是在C和C ++中都可以使用许多免费实现。

These are C implementations: 这些是C实现:

http://www.cl.cam.ac.uk/~cwc22/hashtable/ http://www.cl.cam.ac.uk/~cwc22/hashtable/

http://wiki.portugal-a-programar.org/c:snippet:hash_table_c http://wiki.portugal-a-programar.org/c:snippet:hash_table_c

Glibc hash tables . Glibc哈希表

Not sure if I understand the question, but if it's not too big a 1:1 mapping , using a preinitialized struct may be the way to go (depending on the code, you could write a program to once emit the content of the init table): 不知道我是否理解这个问题,但是如果1:1映射不是太大,那么使用预初始化的结构可能是可行的方法(取决于代码,您可以编写一个程序来一次发出init表的内容):

struct MAP { int from, to; };

MAP somemapping[MAXMAP]= {
    { 0x101,  0x01 },
    { 0x102,  0x02 },

};

Using bsearch() would be a reasonably quick way to do lookups; 使用bsearch()将是进行查找的一种相当快速的方法。

If the code is extremely performance senstitive, you could build an index based lookup table: 如果代码对性能非常敏感,则可以构建基于索引的查找表:

int lookup[65536];


/* init build lookup table once */
init() 
{
  for (int i= 0; i<MAXMAP; i++) {
     lookup[somemapping[i].from]= somemapping[i].to;
  }
}



foo() 
{
  ....
   /* quick lookup */
  to= lookup[from];
  ....
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM