简体   繁体   English

TWebBrowser的HTML源代码-如何检测流编码?

[英]HTML source code from TWebBrowser - How to detect Stream encoding?

Based on this question: How can I get HTML source code from TWebBrowser 基于以下问题: 如何从TWebBrowser获取HTML源代码

If I run this code with a html page that has Unicode code page, the result is gibberish becouse TStringStream is not Unicode in D7. 如果我使用具有Unicode代码页的html页面运行此代码 ,则结果是乱码,因为TStringStream在D7中不是Unicode。 the page might be UTF8 encoded or other (Ansi) code page encoded. 该页面可能是UTF8编码的,也可能是其他(Ansi)代码页编码的。

How can I detect if a TStream/IPersistStreamInit is Unicode/UTF8/Ansi? 如何检测TStream / IPersistStreamInit是否为Unicode / UTF8 / Ansi?

How do I always return correct result as WideString for this function? 如何始终为此函数以WideString返回正确的结果?

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;

If I replace TStringStream with TMemoryStream, and save TMemoryStream to file it's all good. 如果我将TStringStream替换为TMemoryStream,然后将TMemoryStream保存到文件中,那一切都很好。 It can be either Unicode/UTF8/Ansi. 它可以是Unicode / UTF8 / Ansi。 but I always want to return the stream back as WideString: 但我一直想将流返回为WideString:

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;
var
  // LStream: TStringStream;
  LStream: TMemoryStream;
  Stream : IStream;
  LPersistStreamInit : IPersistStreamInit;
begin
  if not Assigned(WebBrowser.Document) then exit;
  // LStream := TStringStream.Create('');
  LStream := TMemoryStream.Create;
  try
    LPersistStreamInit := WebBrowser.Document as IPersistStreamInit;
    Stream := TStreamAdapter.Create(LStream,soReference);
    LPersistStreamInit.Save(Stream,true);
    // result := LStream.DataString;
    LStream.SaveToFile('c:\test\test.txt'); // test only - file is ok
    Result := ??? // WideString
  finally
    LStream.Free();
  end;
end;

EDIT: I found this article - How to load and save documents in TWebBrowser in a Delphi-like way 编辑:我找到了这篇文章- 如何以类似Delphi的方式在TWebBrowser中加载和保存文档

Which does exactlly what I need. 正是我需要的。 but it works correctlly only with Delphi Unicode compilers (D2009+). 但它只能与Delphi Unicode编译器(D2009 +)一起正常工作。 read Conclusion section: 阅读结论部分:

There is obviously a lot more we could do. 显然,我们还有很多事情要做。 A couple of things immediately spring to mind. 立刻想到了几件事。 We retro-fit some of the Unicode functionality and support for non-ANSI encodings to the pre-Unicode compiler code. 我们对Unicode功能进行了改型,并在Unicode之前的编译器代码中支持非ANSI编码。 The present code when compiled with anything earlier than Delphi 2009 will not save document content to strings correctly if the document character set is not ANSI. 如果文档字符集不是ANSI,则用Delphi 2009之前的任何版本编译的当前代码将无法将文档内容正确保存到字符串中。

The magic is obviously in TEncoding class ( TEncoding.GetBufferEncoding ). 魔术显然在TEncoding类( TEncoding.GetBufferEncoding )中。 but D7 does not have TEncoding . 但是D7没有TEncoding Any ideas? 有任何想法吗?

I used GpTextStream to handle the convertion (Should work for all Delphi versions): 我使用GpTextStream处理转换(对于所有Delphi版本均适用):

function GetCodePageFromHTMLCharSet(Charset: WideString): Word;
const
  WIN_CHARSET = 'windows-';
  ISO_CHARSET = 'iso-';
var
  S: string;
begin
  Result := 0;
  if Charset = 'unicode' then
    Result := CP_UNICODE else
  if Charset = 'utf-8' then
    Result := CP_UTF8 else
  if Pos(WIN_CHARSET, Charset) <> 0 then
  begin
    S := Copy(Charset, Length(WIN_CHARSET) + 1, Maxint);
    Result := StrToIntDef(S, 0);
  end else
  if Pos(ISO_CHARSET, Charset) <> 0 then // ISO-8859 (e.g. iso-8859-1: => 28591)
  begin
    S := Copy(Charset, Length(ISO_CHARSET) + 1, Maxint);
    S := Copy(S, Pos('-', S) + 1, 2);
    if S = '15' then // ISO-8859-15 (Latin 9)
      Result := 28605
    else
      Result := StrToIntDef('2859' + S, 0);
  end;
end;

function GetWebBrowserHTML(WebBrowser: TWebBrowser): WideString;
var
  LStream: TMemoryStream;
  Stream: IStream;
  LPersistStreamInit: IPersistStreamInit;
  TextStream: TGpTextStream;
  Charset: WideString;
  Buf: WideString;
  CodePage: Word;
  N: Integer;
begin
  Result := ''; 
  if not Assigned(WebBrowser.Document) then Exit;
  LStream := TMemoryStream.Create;
  try
    LPersistStreamInit := WebBrowser.Document as IPersistStreamInit;
    Stream := TStreamAdapter.Create(LStream, soReference);
    if Failed(LPersistStreamInit.Save(Stream, True)) then Exit;
    Charset := (WebBrowser.Document as IHTMLDocument2).charset;
    CodePage := GetCodePageFromHTMLCharSet(Charset);
    N := LStream.Size;
    SetLength(Buf, N);
    TextStream := TGpTextStream.Create(LStream, tsaccRead, [], CodePage);
    try
      N := TextStream.Read(Buf[1], N * SizeOf(WideChar)) div SizeOf(WideChar);
      SetLength(Buf, N);
      Result := Buf;
    finally
      TextStream.Free;
    end;
  finally
    LStream.Free();
  end;
end;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM