简体   繁体   English

为什么非Unicode页面为韩语(949)时ReadLn会误解UTF8文本?

[英]Why does ReadLn mis-interpret UTF8 text when non-unicode page is Korean (949)?

In Delphi XE2 I can only read and display unicode characters (from a UTF8 encoded file) when the system locale is English using the AssignFile and ReadLn() routines. 在Delphi XE2中,当系统区域设置为英文时,我只能使用AssignFileReadLn()例程读取和显示Unicode字符(来自UTF8编码文件)。

Where it fails 失败的地方
If I set the system locale for non-unicode applications to Korean (codepage 949, I think) and repeat the same read, some of my UTF8 multi-byte pairs get replaced with $3F . 如果我将非Unicode应用程序的系统语言环境设置为朝鲜语(我认为是代码页949)并重复相同的读取操作,则我的一些UTF8多字节对将替换为$3F This only applies to using ReadLn and not when using TFile.ReadAllText(aFilename, TEncoding.UTF8) or TFileStream.Read() . 这仅适用于使用ReadLn ,不适用于TFile.ReadAllText(aFilename, TEncoding.UTF8)TFileStream.Read()

The test 考试
1. I create a text file, UTF8 w/o BOM (Notepad++) with following characters (hex equivalent shown on second line): 1.我创建了一个文本文件,UTF8 w / o BOM(Notepad ++),其中包含以下字符(等效于十六进制的第二行):

테스트
ed 85 8c ec 8a a4 ed 8a b8
  1. Write a Delphi XE 2 Windows form application with TMemo control: 用TMemo控件编写一个Delphi XE 2 Windows窗体应用程序:

     procedure TForm1.ReadFile(aFilename:string); var gFile : TextFile; gLine : RawByteString; gWideLine : string; begin AssignFile(gFile, aFilename); try Reset(gFile); Memo1.Clear; while not EOF(gFile) do begin ReadLn(gFile, gLine); gWideLine := UTF8ToWideString(gLine); Memo1.Lines.Add(gWideLine); end; finally CloseFile(gFile); end; end; 
  2. I inspect the contents of gLine before performing a UTF8ToWideString conversation and under English / US locale Windows it is: 在执行UTF8ToWideString对话之前,我检查了gLine的内容,在英语/美国语言环境下,它是:

    $ED $85 $8C $EC $8A $A4 $ED $8A $B8

As an aside, if I read the same file with a BOM I get the correct 3 byte preamble and the output when the UTF8 decode is performed is the same. 顺便说一句,如果我用BOM读取同一文件,则会得到正确的3字节前同步码,并且执行UTF8解码时的输出是相同的。 All OK so far! 到目前为止一切正常!

  1. Switch Windows 7 (x64) to use Korean as the codepage for applications without Unicode support (Region and Language --> Administrative tab --> Change system locale --> Korean (Korea). Restart computer. 切换Windows 7(x64)以使用韩语作为不支持Unicode的应用程序的代码页(“区域和语言”->“管理”选项卡–>更改系统区域设置->“韩语(韩国)”),重新启动计算机。

  2. Read same file (UTF8 w/o BOM) with above application and gLine now has hex value: 读取与上述应用程序相同的文件(不带BOM的UTF8),并且gLine现在具有十六进制值:

    $3F $8C $EC $8A $A4 $3F $3F

    Output in TMemo: ? 스?? TMemo中的输出:? 스??

  3. Hypothesis that ReadLn() (and Read() for that matter) are attempting to map UTF8 sequences as Korean multibyte sequences (ie Tries to interpret $ED $85, can't and so subs in question mark $3F). 假设ReadLn() (和Read() )试图将UTF8序列映射为韩文多字节序列(即,试图解释$ ED $ 85,不能解释为有问题的子$ 3F)。

  4. Use TFileStream to read in exactly the expected number of bytes (9 w/o BOM) and the hex in memory is now exactly: 使用TFileStream准确读入预期的字节数(9 w / o BOM),并且内存中的十六进制现在正好是:

    $ED $85 $8C $EC $8A $A4 $ED $8A $B8

    Output in TMemo: 테스트 (perfect!) TMemo中的输出:테스트(完美!)

Problem: Laziness - I've a lot of legacy routines that parse potentially large files line by line and I wanted to be sure I didn't need to write a routine to manually read until new lines for each of these files. 问题:懒惰-我有很多传统的例程逐行解析潜在的大文件,并且我想确保不需要编写例程来手动读取这些文件中的每个新行。

Question(s): 问题:

  1. Why is Read() not returning me the exact byte string as found in the file? 为什么Read()不能返回文件中找到的确切字节字符串? Is it because I'm using a TextFile type and so Delphi is doing a degree of interpretation using the non-unicode codepage? 是因为我使用的是TextFile类型,所以Delphi使用非Unicode代码页进行了一定程度的解释?

  2. Is there a built in way to read a UTF8 encoded file line by line? 是否有内置的方式逐行读取UTF8编码的文件?

Update: 更新:

Just came across Rob Kennedy's solution to this post which reintroduces me to TStreamReader, which answers the question about graceful reading of UTF8 files line by line. 刚刚遇到了Rob Kennedy对本文的解决方案, 解决方案将我重新介绍给TStreamReader,它回答了有关逐行优雅读取UTF8文件的问题。

Is there a built in way to read a UTF8 encoded file line by line? 是否有内置的方式逐行读取UTF8编码的文件?

Use TStreamReader . 使用TStreamReader It has a ReadLine() method. 它具有ReadLine()方法。

    procedure TForm1.ReadFile(aFilename:string);
    var
      gFile     : TStreamReader;
      gLine     : string;
    begin
      Memo1.Clear;
      gFile := TStreamReader.Create(aFilename, TEncoding.UTF8, True);
      try
        while not gFile.EndOfStream do
        begin
          gLine := gFile.ReadLine;
          Memo1.Lines.Add(gLine);
        end;
      finally
        gFile.Free;
      end;
    end;

With that said, this particular example can be greatly simplified: 话虽如此,这个特定的例子可以大大简化:

    procedure TForm1.ReadFile(aFilename:string);
    begin
      Memo1.Lines.LoadFromFile(aFilename, TEncoding.UTF8);
    end;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM