C程序来替换扩展的ASCII并在cli上测试/打印它们

Question

I currently have following code to lookup/replace (via a lookup list which is actually a char array of char arrays) extended ASCII characters. 我目前有以下代码来查找/替换（通过一个实际上是char数组的char数组的查找列表）扩展的ASCII字符。 The replacing itself seems to work fine (although any tips for improvement, always welcome) but when using it on the cli (Ubuntu 15.04) , I get weird symbols back. 替换本身似乎可以很好地工作（尽管有任何改进的提示，总是欢迎），但是在cli上使用它时（Ubuntu 15.04），我得到了奇怪的符号。 Now, I'm a bit confused if this because my C code is not good enough or my terminal does not "know" how to print certain characters? 现在，如果这是因为我的C代码不够好或者我的终端不“知道”如何打印某些字符，我会感到困惑。

-------------- C code -------------- -------------- C代码--------------

/* Include system header files.
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>




unsigned char* sanitizeString(unsigned char *pCharArg1)
{
    unsigned char *pCharWorker = pCharArg1;

    /* The look-up map
     */
    unsigned char* charLookup[] = { "ab","àa", "ss", "åa", "ÅA", "ÿy", "XX","" };

    /* For every character in the input string we're going to verify
     * if the character needs to be replaced with one from the look-up
     * map.
     */
    while ( *pCharWorker != '\0' ) { 
        printf( "STARTING NEXT CHAR \n");   
        int finishedFlag = 0;
        //if ( (((int) *pCharWorker >= 65) && ((int) *pCharWorker <= 122)) ) {
            int j = 0;  
            /*
             * Loop the look-up map
             */
            while ((*(charLookup[j]) !='\0') && (finishedFlag == 0)) {
                printf( "Analazying *pCharWorker CHAR : %c \n", *pCharWorker    );
                printf( "Analazying *pCharWorker INT : %d \n", *pCharWorker    );
                printf( "Analazying *(charLookup[j]) CHAR  : %c \n", *(charLookup[j])    );         
                printf( "Analazying *(charLookup[j]) INT : %d \n", *(charLookup[j])    );       
                /* Inspected character matches one from the lookup map,
                 * so fetch the new character and assign it.
                 */
                if( *pCharWorker == *(charLookup[j]) ){
                    printf( "Detected char: %c \n", *pCharWorker   ); 
                    *pCharWorker = *(charLookup[j]+1);
                    printf( "Replaced with char: %c \n", *pCharWorker   ); 
                    finishedFlag = 1;   
                }
                j++;
            }
    //  }    
        printf( "======================= \n"  );             
        pCharWorker++;      
    }
    return pCharArg1;     
}


int main( int argc,  char* argv[] ){
    unsigned char* z = argv[1];
    printf( "PRINT : %s \n",  z ); 
    unsigned char* p2 = sanitizeString( z);
    printf( "Sanitized string: %s \n",  p2 ); 
    return 0;
}

Gives for example when executing: 例如在执行时给出：

koen@beite-f1:~$ gcc -o san sanitize.c koen @ beite-f1：〜$ gcc -o san sanitize.c

koen@beite-f1:~$ ./san ç koen @ beite-f1：〜$ ./sanç

PRINT : ç 打印：ç

STARTING NEXT CHAR 开始下一个字符

Analazying *pCharWorker CHAR : 分析* pCharWorker CHAR：

... ...

Sanitized string: 消毒字符串：

A great thanks for any input 非常感谢您的投入

br, Koen. br，科恩。

Answer 1

Your translation is failing because when charLookup is created some of the strings are longer than 2 chars because C is encoding them as variable length UTF-8. 您的翻译失败，因为在创建charLookup时，某些字符串长于2个字符，因为C将它们编码为可变长度UTF-8。 You've got utf8_string,output_char Dump the strings out in hex and you'll see. 您已经有了utf8_string,output_char以十六进制形式转储字符串，您将看到。

For example, the second element that translates an accented "a" has a hex value of 例如，翻译重音符号“ a”的第二个元素的十六进制值为

C3 A0 61 00

Consider reversing the order within each of the elements in charLookup. 考虑反转charLookup中每个元素内的顺序。 That way, you'll have output_char,utf8_string and the second element becomes: 这样，您将拥有output_char,utf8_string ，第二个元素变为：

61 C3 A0 00

That way you can modify your code a bit. 这样，您可以稍微修改代码。 Note that you'll need to split pCharWorker into a source/dest pointers as in pCharInput/pcharOutput 请注意，您需要像pCharInput / pcharOutput中那样将pCharWorker拆分为源/目标指针。

char *xlat = charLookup[j];
char clean_char = xlat[0];
char *dirty_utf8 = xlat + 1;
int dirty_len = strlen(dirty_utf8);

if (strncmp(pCharInput,dirty_utf8,dirty_len) == 0) {
    *pCharOutput++ = clean_char;
    pCharInput += dirty_len
}
else {
    *pCharOutput++ = *pCharInput++;
}

NOTE: at the bottom of the function, you'll need a *pCharOutput = 0; 注意：在函数的底部，您需要*pCharOutput = 0; that you didn't need before. 以前不需要的

The above is just a fragment to give you the idea but it should be easy to incorporate. 上面只是提供给您想法的一个片段，但是应该易于合并。 Note I did xlat et. 注意我做了xlat等。 al. 等 as defs with assignments for brevity. 为简洁起见，定义为defs。 You may split them into defs at func top and assignments in loop body if you wish. 如果愿意，可以将它们分解为func顶部的def和循环体中的分配。

You can also add an optimization, taking advantage of the fact that you can only have a UTF-8 string at the current position in the input string if the char is >= 0x80 (MSB set). 您还可以利用以下事实来添加优化：如果char> = 0x80（设置了MSB），则只能在输入字符串的当前位置使用UTF-8字符串。 Then, you can skip a pass through charLookup. 然后，您可以跳过通过charLookup的过程。 So: 所以：

// skip charLookup scan if unnecessary
if ((*pCharInput & 0x80) == 0) {
    *pcharOutput++ = *pCharInput++;
    continue;
}

UPDATE: 更新：
Since you were amenable to tips, here's the full boat as I would do it. 既然您愿意付小费，那么这就是我会做的完整的工作。 Note that the translation array should be global/static or the func prolog will recreate every time on entry. 请注意，转换数组应该是全局/静态的，否则func序言将在每次输入时重新创建。 Also, the strlen/strncmp is unnecessary. 另外，strlen / strncmp也是不必要的。 I've also changed the loops around. 我也改变了循环。

NOTE: This example has special handling for utf8 input that is not found in the translation, so at a mininum take a look at it. 注意：此示例对utf8输入进行了特殊处理，但在翻译中找不到，因此请最少看一下。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>

char *xlatlist[] = { "ba","aà", "ss", "aå", "AÅ", "yÿ", "XX", NULL };

// sanitize -- clean up utf8 sequences in a string
void
sanitize(char *dst,int uglyflg)
// uglyflg -- 1=output unknown utf8's
{
    char *src;
    char * const *xlatptr;
    const char *xlatstr;
    const char *utf8;
    int len;
    int foundflg;
    int chr;

    src = dst;

    while (1) {
        chr = *src;
        if (chr == 0)
            break;

        // skip translation loop if not utf-8
        if ((chr & 0x80) == 0) {
            *dst++ = *src++;
            continue;
        }

        // try to match a translation
        foundflg = 0;
        for (xlatptr = xlatlist;  ;  ++xlatptr) {
            xlatstr = *xlatptr;
            if (xlatstr == NULL)
                break;

            utf8 = xlatstr + 1;
            len = strlen(utf8);
            if (strncmp(src,utf8,len) == 0) {
                *dst++ = xlatstr[0];
                foundflg = 1;
                src += len;
                break;
            }
        }

        // utf8 translation found
        if (foundflg)
            continue;

        // NOTES:
        // (1) because of the optimization above, src _is_ pointing to a utf8
        //     but we have _no_ translation for it
        // (2) we can choose to skip it or just output it [and hope for the
        //     best], but ...
        // (3) first, we need to get the length utf8 string, so we only
        //     skip/output one utf8 string/char (e.g. we could have
        //     back-to-back utf8 strings)
        // (4) for reference here, the utf8 encoding is:
        //       byte 0: 11xxxxxx
        //       byte 1: 10xxxxxx

        // output the first char of the unknown utf8 sequence
        if (uglyflg)
            *dst++ = *src;
        ++src;

        // output the remaining ones
        for (;  ; ++src) {
            chr = *src;

            // EOS
            if (chr == 0)
                break;

            // back to ascii
            if ((chr & 0x80) == 0)
                break;

            // start of new utf8 string
            if ((chr & 0x40) != 0)
                break;

            // output the unknown utf8 secondary char
            if (uglyflg)
                *dst++ = chr;
        }
    }

    *dst = 0;
}

int
main(int argc,char **argv)
{
    char *z;

    --argc;
    ++argv;

    z = *argv;
    if (z == NULL) {
        printf("no argument provided\n");
        exit(1);
    }

    printf("PRINT : %s\n",z); 

    sanitize(z,0);
    printf("Sanitized string: %s\n",z); 

    return 0;
}

C程序来替换扩展的ASCII并在cli上测试/打印它们

问题描述

... ...

1 个解决方案

解决方案1
2 2015-10-11 23:44:14

C程序来替换扩展的ASCII并在cli上测试/打印它们

问题描述

... ...

1 个解决方案

解决方案1 2 2015-10-11 23:44:14

解决方案1
2 2015-10-11 23:44:14