從 IHTMLDocument2* 獲取頁面上的可見文本

Question

我正在嘗試獲取 Inte.net Explorer web 瀏覽器 window 的文本內容。

我正在執行以下步驟：

獲得指向 IHTMLDocument2 的指針
從 IHTMLDocument2 我獲得正文作為 IHTMLElement
~~3. 在 body 上我調用 get_innerText~~

編輯

我獲得身體的所有孩子並嘗試對所有 IHTMLElements 進行遞歸調用
如果我得到任何不可見的元素，或者如果我得到一個標簽為腳本的元素，我將忽略該元素及其所有子元素。

我的問題是

除了頁面上可見的文本外，我還獲得了具有 which style="display: none"的內容
對於 google.com，我還得到了 javascript 和文本。

我嘗試了一種遞歸方法，但我對如何處理這樣的場景一無所知，

<div>
Hello World 1
<div style="display: none">Hello world 2</div>
</div>

在這種情況下，我將無法獲得“Hello World 1”

誰能幫我找出從 IHTMLDocument2* 中獲取文本的最佳方法。 我正在使用 C++ Win32，沒有 MFC，ATL。

謝謝，阿希什。

Answer 1

如果您在document.body.all元素上向后迭代，您將始終從里到外地遍歷這些元素。 所以你不需要自己走遞歸。 DOM 會為你做那件事。 例如（代碼在 Delphi 中）：

procedure Test();
var
  document, el: OleVariant;
  i: Integer;
begin
  document := CreateComObject(CLASS_HTMLDocument) as IDispatch;
  document.open;
  document.write('<div>Hello World 1<div style="display: none">Hello world 2<div>This DIV is also invisible</div></div></div>');
  document.close;
  for i := document.body.all.length - 1 downto 0 do // iterate backwards
  begin
    el := document.body.all.item(i);
    // filter the elements
    if (el.style.display = 'none') then
    begin
      el.removeNode(true);
    end;
  end;
  ShowMessage(document.body.innerText);
end;

旁注：至於您使用遞歸方法的場景：

<div>Hello World 1<div style="display: none">Hello world 2</div></div>

例如，如果我們的元素是第一個 DIV， el.getAdjacentText('afterBegin')將返回"Hello World 1" 。 所以我們可能可以向前迭代元素並收集getAdjacentText('afterBegin') ，但這有點困難，因為我們需要為el.currentStyle.display測試每個元素的父元素。

從 IHTMLDocument2* 獲取頁面上的可見文本

問題描述

1 個解決方案

解決方案1
6 已采納 2012-04-09 09:19:12

從 IHTMLDocument2* 獲取頁面上的可見文本

問題描述

1 個解決方案

解決方案1 6 已采納 2012-04-09 09:19:12

解決方案1
6 已采納 2012-04-09 09:19:12