[英]Return proper umlaut character from pointer to char?
I am trying to get proper character descriptions out of a legacy FAME database file. 我试图从旧版FAME数据库文件中获取正确的字符描述。 Basically this works, but the umlauts etc. are not printed correctly.
基本上可以,但是未正确打印变音符号等。 Basically the following C function that is contained in the
R Package FAME
to this is rather a C question than an R question. 基本上,
R Package FAME
包含的以下C函数对此不是C问题,而是C问题。
void fameWhat(int *status, int *dbkey, char **objnam, int *class,
int *type, int *freq, int *basis, int *observ,
int *fyear, int *fprd, int *lyear, int *lprd,
int *obs, int *range,
int * getdoc, char **desPtr, char **docPtr){
/* Get info about an object. Note that range should be an int[3] on input */
int cyear, cmonth, cday, myear, mmonth, mday;
int i;
char fdes[256], fdoc[256];
if(*getdoc){
if(strlen(*desPtr) < 256 || strlen(*docPtr) < 256){
*status = HBNCHR;
return;
}
for(i = 0; i < 255; ++i) fdes[i] = fdoc[i] = ' ';
}
fdes[255] = fdoc[255] = '\0';
cfmwhat(status, *dbkey, *objnam, class, type, freq, basis, observ,
fyear, fprd, lyear, lprd, &cyear, &cmonth, &cday, &myear,
&mmonth, &mday, fdes, fdoc);
if(*getdoc){
strncpy(*desPtr, fdes, 256);
strncpy(*docPtr, fdoc, 256);
}
if(*status == 0 && *class == HSERIE)
cfmsrng(status, *freq, fyear, fprd, lyear, lprd, range, obs);
return;
}
I feel that due to the fact that the pointer to pointer desPtr
which points to the description is of type char
I do not get any proper umlauts when calling this function from R and displaying the result within an R console. 我觉得,由于指向描述的指针
desPtr
的指针是char
类型的事实,当从R调用此函数并在R控制台中显示结果时,我没有得到适当的变调。 I have a hunch that FAME is Latin-1 encoded. 我直觉FAME是Latin-1编码的。 R is UTF-8.
R是UTF-8。 For
ä
I get \\U3e34653c
for example. 对于
ä
例如,我得到\\U3e34653c
。
So is there a way of getting it done already in C and pass proper values to R or should I rather search and replace within R? 那么有没有办法在C中完成它并将正确的值传递给R,还是应该在R中搜索并替换呢?
Note: I have seen this thread Using Unicode in C++ source code and this How to use utf8 character arrays in c++? 注意:我已经看到了该线程在C ++源代码中使用Unicode,以及如何在c ++中使用utf8字符数组? .
。
It seems you have some multiple stacked encoding/decoding. 看来您有一些堆叠的编码/解码。 How did you 'get' such a long Unicode value for a single character in the first place?
首先如何为单个字符“获得”如此长的Unicode值?
The raw hex-to-ASCII translation of that long code is either >4E<
or <E4>
(depending on endianness), and the latter, interpreted as a bracketed hex value, is the ä
you were expecting: http://www.fileformat.info/info/unicode/char/00E4/index.htm , which is a valid Latin-1 encoding. 该长代码的原始十六进制到ASCII转换是
>4E<
或<E4>
(取决于字节序),而后者被解释为带括号的十六进制值,是您所期望的ä
: http:// www .fileformat.info / info / unicode / char / 00E4 / index.htm ,这是有效的Latin-1编码。
Converting from this coded format to UTF8 is relatively simple, although I am not sure where to paste in this code into the existing routine. 从这种编码格式转换为UTF8相对简单,尽管我不确定将此代码粘贴到现有例程中的位置。 As a sample standalone program:
作为示例独立程序:
#include <stdio.h>
#include <stdlib.h>
int main (void)
{
char input[] = "a sm<F6>rg<E5>sbord of <code>";
char *sourceptr, *destptr, *endptr;
int latin1;
sourceptr = input;
destptr = input;
while (*sourceptr)
{
if (*sourceptr == '<')
{
latin1 = strtol (sourceptr+1, &endptr, 16);
if (endptr && *endptr == '>' && latin1 > 127 && latin1 <= 255)
{
/* printf ("we saw hex code %xh\n", latin1); */
/* Quick-and-dirty converting to UTF8: */
*destptr = (char)(0xc0 | ((latin1 & 0xc0) >> 6));
destptr++;
*destptr = (char)(0x80 | (latin1 & 0x3f));
destptr++;
sourceptr = endptr+1;
continue;
}
}
*destptr = *sourceptr;
sourceptr++;
destptr++;
}
*destptr = 0;
printf ("output: %s\n", input);
return 0;
}
This scans the input string for <
followed by a valid hex code (assuming it's Latin-1 and so it's restricted to 80..FF) and another >
. 这将扫描输入字符串中的
<
后跟有效的十六进制代码(假设它是Latin-1,因此限制为80..FF)和另一个>
。 When found, it inserts the character in UTF8 format. 找到后,它将以UTF8格式插入字符。 Unrecognized sequences are copied as-is.
无法识别的序列照原样复制。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.