简体   繁体   English

Javascript:检索window.open innerHTML

[英]Javascript: Retrieve window.open innerHTML

I would like to start by pointing out that I know this is probably failing because of cross domain restrictions - just want that confirming really. 首先,我想指出的是,我知道这可能是由于跨域限制而导致的失败-只是真的要确认一下。

I have a window which I open with javascript. 我有一个用javascript打开的窗口。 I then use an Ajax request to pull the contents of a site and echo that (including in a base href link to force it to work relatively) into the new window. 然后,我使用Ajax请求来拉取网站的内容,并将其内容(包括在基本href链接中强制其相对工作)回显到新窗口中。

The idea is that I can scrape the JS rendered HTML to see if the site is really running our banners or not ( we have a suspicion that they are not! ) 我的想法是,我可以抓取JS呈现的HTML,以查看该网站是否真的在运行我们的横幅广告(我们怀疑它们是否在运行!)

I open the window with this: 我用这个打开窗口:

msaScrape.msaWin = window.open ('null.php', 'msa_weed', "scrollbars=yes,toolbar=no,status=no,width=1000,height=1000");

This loads the new window with the contents of the target page and correctly loads and renders the JS fired stuff too ( the banners is the bit im after ). 这将使用目标页面的内容加载新窗口,并且也正确加载并呈现JS触发的内容(横幅是im后面的位)。

I have tried msaScrape.msaWin.document.body, msaScrape.msaWin.document.body.innerHTML and many - MANY other combinations but none will give me back the fully rendered HTML. 我已经尝试过msaScrape.msaWin.document.body,msaScrape.msaWin.document.body.innerHTML和许多其他组合,但是没有任何组合可以给我带来完全呈现的HTML。

When I run the test on the raw buffer from the Ajax request I can detect embedded strings fine - but since the banners are being loaded via JS I need them to be loaded into the DOM before I can search the HTML for the banner ID. 当我从Ajax请求对原始缓冲区进行测试时,我可以很好地检测到嵌入的字符串-但是由于横幅是通过JS加载的,因此我需要先将它们加载到DOM中,然后才能在HTML中搜索横幅ID。

Is what I am trying to do possible or am I trying to do something that cannot be done? 我正在尝试做的事情是可能的还是我试图做一些无法完成的事情? I find it odd that I can write into this popup window, and that I can scan (and find matches in) the raw, unrendered buffer. 我可以写入此弹出窗口,并且可以扫描(并在其中找到未匹配的)原始缓冲区,这很奇怪。 Its as soon as I have allowed the popup page to render the HTML that it falls down and I can't get at the source. 一旦我允许弹出页面呈现掉下来的HTML,而我却无法从源头上获取它,它就可以了。

If required I can post the entire (small) JS bit that I am trying to do the scrape and match - just checking with the client if they mind me doing that ( its for a private client and don't want to upset them! ) 如果需要,我可以张贴我尝试进行刮擦和匹配的整个(小)JS位-只是与客户核对,如果他们介意我这样做(这是针对私人客户的,不想让他们烦恼!)

Here is how I got it to scan the innerHTML of a remotely loaded window: 这是我如何扫描远程加载窗口的innerHTML的方法:

            setTimeout(function(){
                window.parent.document.getElementById('stopScraper').focus();
                if(window.parent.msaScrape.msaWin.document.body.innerHTML.match(window.parent.msaScrape.msaTest)){
                    window.parent.msaScrape.msaHits++;
                }
            }, 1000);
            window.parent.focus();

stopScraper was just a form input that allowed me to give the focus back to the calling page. stopScraper只是一个表单输入,它使我可以将焦点移回到调用页面。

The problem was being caused by the popup not having enough time to render its Dom ( plus I had to inject a base href="http://www.example.com" when I grabbed the content as a string with PHP to ensure that paths worked when I echo'd out the string into null.php) 问题是由于弹出窗口没有足够的时间呈现其Dom所引起的(另外,当我使用PHP抓取内容作为字符串时,我必须注入基本的href =“ http://www.example.com”,以确保当我将字符串回显为null.php时,路径有效

I ran it, with an interval of 8.5 seconds between requests and then give the popup another second to fully load its Dom before trying to read the stuff that was loaded by the in-page JS files. 我运行了它,两次请求之间的间隔为8.5秒,然后在尝试读取页内JS文件加载的内容之前,给弹出窗口再一秒钟以完全加载其Dom。

Final results from live, Cross domain tests: 实时跨域测试的最终结果:

Requests: 4024 Scrapes: 4024 ( didnt miss a beat! ) Hits: 147 ( was looking for a particular banner in Dom ) 请求:4024报废:4024(没有错过任何节拍!)点击数:147(当时正在Dom中寻找特定横幅)

If people want more explanation on how I did this then its probably better to email me and Ill just send you the whole engine - it has a test mode built in to test it with before you try it on your other domain! 如果人们想对我如何做到这一点有更多的解释,那么最好给我发电子邮件,然后让我把整个引擎发送给我-它内置了一个测试模式,可以在您在其他域上尝试之前对其进行测试! Several files though - plus I'm not too sure on the legality of what I was doing so don't think I should make the whole answer public! 不过有几个文件-再加上我不太确定自己在做什么的合法性,所以不要以为我应该将整个答案公开!

In short though, if you load your content via same domain using a PHP file_get_contents, add the base href (if missing), echo as content for null.php ( open this window as a popup using javascript as shown in top question ) - the code here WILL match your test string against the fully loaded Dom 简而言之,如果您使用PHP file_get_contents通过同一域加载内容,请添加基本href(如果缺少),将echo作为内容作为null.php的回显(使用弹出窗口,使用javascript打开此窗口,如顶部问题所示)-此处的代码将使您的测试字符串与完全加载的Dom匹配

I would like to stress at this point that I needed to test everything (including banners loaded by external JS files ) so HAD to render the raw HTML in a browser to cause the JS to fire. 在这一点上,我想强调一点,我需要测试所有内容(包括外部JS文件加载的横幅),以便HAD在浏览器中呈现原始HTML以触发JS。 I had also looked at PhantomJS but didn't need it in the end! 我也看过PhantomJS,但最终不需要它! Managed to solve the problem with nothing but JS :) 设法解决问题,只用JS :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM