简体   繁体   English

从 IHTMLDocument2* 获取页面上的可见文本

[英]Obtaining visible text on a page from an IHTMLDocument2*

I am trying to obtain the text content of a Inte.net Explorer web browser window.我正在尝试获取 Inte.net Explorer web 浏览器 window 的文本内容。

I am following these steps:我正在执行以下步骤:

  1. obtain a pointer to IHTMLDocument2获得指向 IHTMLDocument2 的指针
  2. from the IHTMLDocument2 i obtain the body as an IHTMLElement从 IHTMLDocument2 我获得正文作为 IHTMLElement
    3. On the body i call get_innerText 3. 在 body 上我调用 get_innerText

Edit编辑


  1. I obtain all the children of the body and try to do a recursive call on all the IHTMLElements我获得身体的所有孩子并尝试对所有 IHTMLElements 进行递归调用
  2. if i get any element which is not visible or if i get an element whose tag is script, i ignore that element and all its children.如果我得到任何不可见的元素,或者如果我得到一个标签为脚本的元素,我将忽略该元素及其所有子元素。

My problem is我的问题是

  1. that along with the text which is visible on the page i also get content having for which style="display: none"除了页面上可见的文本外,我还获得了具有 which style="display: none"的内容
  2. For google.com, i also get javascript along with the text.对于 google.com,我还得到了 javascript 和文本。

I have tried a recursive approach, but i am clueless as to how to deal with scenarios like this,我尝试了一种递归方法,但我对如何处理这样的场景一无所知,

<div>
Hello World 1
<div style="display: none">Hello world 2</div>
</div>

In this scenario i wont be able to get "Hello World 1"在这种情况下,我将无法获得“Hello World 1”

Can anyone please help me out with the best way to obtain the text from an IHTMLDocument2*.谁能帮我找出从 IHTMLDocument2* 中获取文本的最佳方法。 I am using C++ Win32, no MFC, ATL.我正在使用 C++ Win32,没有 MFC,ATL。

Thanks, Ashish.谢谢,阿希什。

If you iterate backwards on the document.body.all elements, you will always walk on the elements inside out.如果您在document.body.all元素上向后迭代,您将始终从里到外地遍历这些元素。 So you don't need to walk recursive yourself.所以你不需要自己走递归。 the DOM will do that for you. DOM 会为你做那件事。 eg (Code is in Delphi):例如(代码在 Delphi 中):

procedure Test();
var
  document, el: OleVariant;
  i: Integer;
begin
  document := CreateComObject(CLASS_HTMLDocument) as IDispatch;
  document.open;
  document.write('<div>Hello World 1<div style="display: none">Hello world 2<div>This DIV is also invisible</div></div></div>');
  document.close;
  for i := document.body.all.length - 1 downto 0 do // iterate backwards
  begin
    el := document.body.all.item(i);
    // filter the elements
    if (el.style.display = 'none') then
    begin
      el.removeNode(true);
    end;
  end;
  ShowMessage(document.body.innerText);
end;

A Side Comment: As for your scenario with the recursive approach:旁注:至于您使用递归方法的场景:

<div>Hello World 1<div style="display: none">Hello world 2</div></div>

If eg our element is the first DIV, el.getAdjacentText('afterBegin') will return "Hello World 1" .例如,如果我们的元素是第一个 DIV, el.getAdjacentText('afterBegin')将返回"Hello World 1" So we can probably iterate forward on the elements and collect the getAdjacentText('afterBegin') , but this is a bit more difficult because we need to test the parents of each element for el.currentStyle.display .所以我们可能可以向前迭代元素并收集getAdjacentText('afterBegin') ,但这有点困难,因为我们需要为el.currentStyle.display测试每个元素的父元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM