简体   繁体   中英

TStringList behavior with non ANSI files

In my application, when I want import a file, i use TStringList.

But, when someone export data from Excel, the file encoding is UCS-2 Little Endian, and TStringList can't read the data.

There is any way to validate this situation, identify the text encoding and send a warning to the user that the text provided is not compatible?

Just to be clear, the user will provide only plain text..letter and numbers, otherwise this, I must send the warning.

Unicode File without BOM is good. (TStringList can read it!)
ANSI file Too. (TStringList can read it!)
Even Unicode with BOM will be good, if there is a way to remove it. (TStringList can read it!, but with "i" ">>" and "reverse ?" characters, that belongs to BOM bytes)

I used the following function in Delphi 6 to detect Unicode BOMs.

  //standard byte order marks (BOMs)
  UTF8BOM:              array [0..2] of AnsiChar = #$EF#$BB#$BF;
  UTF16LittleEndianBOM: array [0..1] of AnsiChar = #$FF#$FE;
  UTF16BigEndianBOM:    array [0..1] of AnsiChar = #$FE#$FF;
  UTF32LittleEndianBOM: array [0..3] of AnsiChar = #$FF#$FE#$00#$00;
  UTF32BigEndianBOM:    array [0..3] of AnsiChar = #$00#$00#$FE#$FF;

function FileHasUnicodeBOM(const FileName: string): Boolean;
  Buffer: array [0..3] of AnsiChar;
  Stream: TFileStream;
  Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite); // Allow other programs read access at the same time.
    FillChar(Buffer, SizeOf(Buffer), $AA);//fill with characters that we are not expecting then...
    Stream.Read(Buffer, SizeOf(Buffer));  //...read up to SizeOf(Buffer) bytes - there may not be enough
    //use Read rather than ReadBuffer so the no exception is raised if we can't fill Buffer
  Result := CompareMem(@UTF8BOM,              @Buffer, SizeOf(UTF8BOM))              or
            CompareMem(@UTF16LittleEndianBOM, @Buffer, SizeOf(UTF16LittleEndianBOM)) or
            CompareMem(@UTF16BigEndianBOM,    @Buffer, SizeOf(UTF16BigEndianBOM))    or
            CompareMem(@UTF32LittleEndianBOM, @Buffer, SizeOf(UTF32LittleEndianBOM)) or
            CompareMem(@UTF32BigEndianBOM,    @Buffer, SizeOf(UTF32BigEndianBOM));

This will detect all the standard BOMs. You could use it to block such files if that's the behaviour you want.

You state that Delphi 6 TStringList can load 16 bit encoded files if they do not have a BOM. Whilst that may be the case, you will find that, for characters in the ASCII range, every other character is #0 . Which I guess is not what you want.

If you want to detect that text is Unicode for files without BOMs then you could use IsTextUnicode . However, it may give false positives. This is a situation where I suspect it is better to ask for forgiveness than permission.

Now, if I were you I would not actually try to block Unicode files. I would read them. Use the TNT Unicode library. The class you want is called TWideStringList .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM