HDF5：如何從 h5dump output 解碼 UTF8 編碼的字符串？

Question

我正在使用 UTF-8 編碼將屬性寫入 HDF5 文件。 例如，我將“äöüß”寫入文件中的屬性“notes”。

我現在正在嘗試解析h5ls （或h5dump ）的 output 以提取回這些數據。 這兩種工具都會給我這樣的 output：

ATTRIBUTE "notes" {
      DATATYPE  H5T_STRING {
         STRSIZE 8;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): "\37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637"
      }
   }

我知道，例如， \37777777703\37777777644以某種方式將ä編碼為0xC3 0xA4 ，但是，我很難想出這種編碼的工作原理。

這背后的神奇公式是什么？我怎樣才能正確地將它解碼回äöüß ？

Answer 1

這些字符串使用 8 進制編碼。我在 PHP 后端使用以下方法對它們進行了解碼：

$line = "This is the text including some UTF-8 bytes \37777777703\37777777644\37777777703\37777777666\37777777703\37777777674\37777777703\37777777637";

// extract UTF-8 Bytes
$octbytes;
preg_match_all("/\\\\37777777(\\d{3})/", $line, $octbytes);

// parse extracted Bytes
for ($m = 0; $m < count($octbytes[1]); ) {
    $B = octdec($octbytes[1][$m]);

    // UTF-8 may span over 2 to 4 Bytes
    $numBytes;
    if (($B & 0xF8) == 0xF0) { $numBytes = 4; } 
    else if (($B & 0xF0) == 0xE0) { $numBytes = 3; } 
    else if (($B & 0xE0) == 0xC0) { $numBytes = 2; } 
    else { $numBytes = 1; }
                            
    $hxstr = "";
    $replaceStr = "";
    for ($j = 0; $j < $numBytes; $j++) {
        $match =  $octbytes[1][$m+$j];
        $dec = octdec($match) & 255;
        $hx = strtoupper(dechex($dec));
        $hxstr = $hxstr . $hx;
        $replaceStr = $replaceStr . "\\37777777" . $match;
    }

    // pack extracted bytes into one hex string
    $utfChar = pack("H*", $hxstr); // < this will be interpreted correctly
    
    // replace Bytes in the input with the parsed chars
    $parsedData = str_replace($replaceStr,$utfChar,$line);

    // go to next byte                            
    $m+=$numBytes;
}
echo "The parsed line: $line";

HDF5：如何從 h5dump output 解碼 UTF8 編碼的字符串？

問題描述

1 個解決方案

解決方案1
0 2023-01-20 12:54:16

HDF5：如何從 h5dump output 解碼 UTF8 編碼的字符串？

問題描述

1 個解決方案

解決方案1 0 2023-01-20 12:54:16

解決方案1
0 2023-01-20 12:54:16