简体   繁体   中英

Get the rendered text from HTML (Delphi)

I have some HTML and I need to extract the actual written text from the page.

So far I have tried using a web browser and rendering the page, then going to the document property and grabbing the text. This works, but only where the browser is supported (IE com object). The problem is I want this to be able to run under wine also, so I need a solution that doesn't use IE COM.

There must be a programatic way to do this that is reasonable.

I'm not sure what the recommended way of parsing HTML in Delphi is, but if it were me, I'd be tempted to just bundle a copy of html2text (either the older C++ program by that name or the newer Python program ) and spawn a call to one of those.

You can turn the Python html2text into an executable using py2exe . Both html2text programs are licensed under the GPL, but as long as you merely bundle their executable with your app and make their source available according to the GPL's restrictions, then you ought to be okay.

Instead of using a TWebBrowser, you can directly use a TIdHttp and its Get method.
You get the html string back.

Here's a nice simple routine, copied from Scalabium :

function StripHTMLTags(const strHTML: string): string;
var
  P: PChar;
  InTag: Boolean;
  i, intResultLength: Integer;
begin
  P := PChar(strHTML);
  Result := '';

  InTag := False;
  repeat
    case P^ of
      '<': InTag := True;
      '>': InTag := False;
      #13, #10: ; {do nothing}
      else
        if not InTag then
        begin
          if (P^ in [#9, #32]) and ((P+1)^ in [#10, #13, #32, #9, '<']) then
          else
            Result := Result + P^;
        end;
    end;
    Inc(P);
  until (P^ = #0);

  {convert system characters}
  Result := StringReplace(Result, '&quot;', '"',  [rfReplaceAll]);
  Result := StringReplace(Result, '&apos;', '''', [rfReplaceAll]);
  Result := StringReplace(Result, '&gt;',   '>',  [rfReplaceAll]);
  Result := StringReplace(Result, '&lt;',   '<',  [rfReplaceAll]);
  Result := StringReplace(Result, '&amp;',  '&',  [rfReplaceAll]);
  {here you may add another symbols from RFC if you need}
end;

You can then easily modify this to do exactly what you want.

Slight enhancement of the previous function for creating multiline text

function StripHTMLTags(strHTML: string): string;
const crlf='&crlf;';
var
  P: PChar;
  InTag: Boolean;
  i, intResultLength: Integer;

begin
  strHTML:=StringReplace(strHTML, '<br/>',crlf,[rfReplaceAll, rfIgnoreCase]);
  strHTML:=StringReplace(strHTML, '</div>',crlf,[rfReplaceAll, rfIgnoreCase]);
  P := PChar(strHTML);
  Result := '';

  InTag := False;
  repeat
    case P^ of
      '<': InTag := True;
      '>': InTag := False;
      #13, #10: ; {do nothing}
      else
        if not InTag then
        begin
          if (P^ in [#9, #32]) and ((P+1)^ in [#10, #13, #32, #9, '<']) then
          else
            Result := Result + P^;
        end;
    end;
    Inc(P);
  until (P^ = #0);

  {convert system characters}
  Result := StringReplace(Result, '&quot;', '"',  [rfReplaceAll]);
  Result := StringReplace(Result, '&apos;', '''', [rfReplaceAll]);
  Result := StringReplace(Result, '&gt;',   '>',  [rfReplaceAll]);
  Result := StringReplace(Result, '&lt;',   '<',  [rfReplaceAll]);
  Result := StringReplace(Result, '&amp;',  '&',  [rfReplaceAll]);
  Result := StringReplace(Result, crlf,  #13#10,  [rfReplaceAll]);

end;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM