简体   繁体   English

PhantomJS传递HTML字符串并返回页面源

[英]PhantomJS pass HTML string and return page source

for a web crawler project in C# I try to execute Javascript and Ajax to retrieve the full page source of a crawled page. 对于C#中的Web爬网程序项目,我尝试执行Javascript和Ajax来检索爬网页面的完整页面源。

I am using an existing web crawler (Abot) that needs a valid HttpWebResponse object. 我正在使用需要有效HttpWebResponse对象的现有Web搜寻器(Abot)。 Therefore I cannot simply use driver.Navigate().GoToUrl() method to retrieve the page source. 因此,我不能简单地使用driver.Navigate().GoToUrl()方法来检索页面源。

The crawler downloads the page source and I want to execute the existing Javascript/Ajax inside the source. 搜寻器下载了页面源,我想在源中执行现有的Javascript / Ajax。

In a sample project I tried the following without success: 在一个示例项目中,我尝试以下操作但未成功:

        WebClient wc = new WebClient();
        string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
        string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
        File.WriteAllText(tmpPath, content);

        var driverService = PhantomJSDriverService.CreateDefaultService();            
        var driver = new PhantomJSDriver(driverService);
        driver.Navigate().GoToUrl(new Uri(tmpPath));
        string renderedContent = driver.PageSource;
        driver.Quit();

You need the following nuget packages to run the sample: https://www.nuget.org/packages/phantomjs.exe/ http://www.nuget.org/packages/selenium.webdriver 您需要以下nuget软件包来运行示例: https : //www.nuget.org/packages/phantomjs.exe/ http://www.nuget.org/packages/selenium.webdriver

Problem here is that the code stops at GoToUrl() and it takes several minutes until program terminates without even giving me the driver.PageSource. 这里的问题是代码在GoToUrl()处停止,并且要花几分钟才能终止程序,甚至没有给我driver.PageSource。

Doing this returns the correct HTML: 这样做会返回正确的HTML:

driver.Navigate().GoToUrl("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string renderedContent = driver.PageSource;

But I don't want to download the data twice. 但是我不想两次下载数据。 The crawler (Abot) downloads the HTML and I just want to parse/render the javascript and ajax. 搜寻器(Abot)下载了HTML,而我只想解析/渲染javascript和ajax。

Thank you! 谢谢!

Without running it, I would bet you need file:/// prior to tmpPath. 如果不运行它,我敢打赌您在tmpPath之前需要file:///。 That is: 那是:

    WebClient wc = new WebClient();
    string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
    string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
    File.WriteAllText(tmpPath, content);

    var driverService = PhantomJSDriverService.CreateDefaultService();            
    var driver = new PhantomJSDriver(driverService);
    driver.Navigate().GoToUrl(new Uri("file:///" + tmpPath));
    string renderedContent = driver.PageSource;
    driver.Quit();

You probably need to allow PhantomJS to make arbitrary requests. 您可能需要允许PhantomJS发出任意请求。 Requests are blocked when the domain/protocol doesn't match as is the case when a local file is opened. 当域/协议不匹配时,将阻止请求,就像打开本地文件时一样。

var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.LocalToRemoteUrlAccess = true;
driverService.WebSecurity = false; // may not be necessary
var driver = new PhantomJSDriver(driverService);

You might need to combine this with the solution of Dave Bush: 您可能需要将此与Dave Bush的解决方案结合起来:

driver.Navigate().GoToUrl(new Uri("file:///" + tmpPath));

Some of the resources have URLs that begin with // which means that the protocol of the page is used when the browser retrieves those resources. 一些资源具有以//开头的URL,这意味着浏览器检索这些资源时将使用页面的协议。 When a local file is read, this protocol is file:// in which case none of those resources will be found. 读取本地文件时,此协议为file://在这种情况下将找不到这些资源。 The protocol must be added to the local file in order to download all those resources. 必须将协议添加到本地文件中,才能下载所有这些资源。

File.WriteAllText(tmpPath, content.Replace('"//', '"http://'));

It is apparent from your output that you use PhantomJS 1.9.8. 输出中可以明显看出,您使用的是PhantomJS 1.9.8。 It may be the case that a newly introduced bug is responsible for this sort of thing. 可能是新引入的错误导致了这种情况。 You should user PhantomJS 1.9.7 with driverService.SslProcotol = 'tlsv1' . 您应该将PhantomJS 1.9.7与driverService.SslProcotol = 'tlsv1'


You should also enable the disk cache if you do this multiple times for the same domain. 如果您对同一域多次执行此操作,则还应该启用磁盘缓存。 Otherwise, the resources are downloaded each time you try to scrape it. 否则,每次您尝试抓取资源时都会下载资源。 This can be done with driverService.DiskCache = true; 这可以通过driverService.DiskCache = true;完成driverService.DiskCache = true;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM