[英]HDF5: How to decode UTF8-encoded string from h5dump output?
I'm writing an attribute to an HDF5 file using UTF-8 encoding.我正在使用 UTF-8 编码将属性写入 HDF5 文件。 As an example, I've written "äöüß" to the attribute "notes" in the file.
例如,我将“äöüß”写入文件中的属性“notes”。
I'm now trying to parse the output of h5ls
(or h5dump
) to extract this data back.我现在正在尝试解析
h5ls
(或h5dump
)的 output 以提取回这些数据。 Either tool gives me an output like this:这两种工具都会给我这样的 output:
ATTRIBUTE "notes" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): "\37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637"
}
}
I'm aware that, eg, \37777777703\37777777644
somehow encodes ä
as 0xC3 0xA4
, however, I have a really hard time coming up with how this encoding works.我知道,例如,
\37777777703\37777777644
以某种方式将ä
编码为0xC3 0xA4
,但是,我很难想出这种编码的工作原理。
What's the magic formula behind this and how can I properly decode it back into äöüß
?这背后的神奇公式是什么?我怎样才能正确地将它解码回
äöüß
?
The strings are encoded using base 8. I've decoded them in the PHP backend using:这些字符串使用 8 进制编码。我在 PHP 后端使用以下方法对它们进行了解码:
$line = "This is the text including some UTF-8 bytes \37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637";
// extract UTF-8 Bytes
$octbytes;
preg_match_all("/\\\\37777777(\\d{3})/", $line, $octbytes);
// parse extracted Bytes
for ($m = 0; $m < count($octbytes[1]); ) {
$B = octdec($octbytes[1][$m]);
// UTF-8 may span over 2 to 4 Bytes
$numBytes;
if (($B & 0xF8) == 0xF0) { $numBytes = 4; }
else if (($B & 0xF0) == 0xE0) { $numBytes = 3; }
else if (($B & 0xE0) == 0xC0) { $numBytes = 2; }
else { $numBytes = 1; }
$hxstr = "";
$replaceStr = "";
for ($j = 0; $j < $numBytes; $j++) {
$match = $octbytes[1][$m+$j];
$dec = octdec($match) & 255;
$hx = strtoupper(dechex($dec));
$hxstr = $hxstr . $hx;
$replaceStr = $replaceStr . "\\37777777" . $match;
}
// pack extracted bytes into one hex string
$utfChar = pack("H*", $hxstr); // < this will be interpreted correctly
// replace Bytes in the input with the parsed chars
$parsedData = str_replace($replaceStr,$utfChar,$line);
// go to next byte
$m+=$numBytes;
}
echo "The parsed line: $line";
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.