简体   繁体   中英

PhantomJS pass HTML string and return page source

for a web crawler project in C# I try to execute Javascript and Ajax to retrieve the full page source of a crawled page.

I am using an existing web crawler (Abot) that needs a valid HttpWebResponse object. Therefore I cannot simply use driver.Navigate().GoToUrl() method to retrieve the page source.

The crawler downloads the page source and I want to execute the existing Javascript/Ajax inside the source.

In a sample project I tried the following without success:

        WebClient wc = new WebClient();
        string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
        string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
        File.WriteAllText(tmpPath, content);

        var driverService = PhantomJSDriverService.CreateDefaultService();            
        var driver = new PhantomJSDriver(driverService);
        driver.Navigate().GoToUrl(new Uri(tmpPath));
        string renderedContent = driver.PageSource;
        driver.Quit();

You need the following nuget packages to run the sample: https://www.nuget.org/packages/phantomjs.exe/ http://www.nuget.org/packages/selenium.webdriver

Problem here is that the code stops at GoToUrl() and it takes several minutes until program terminates without even giving me the driver.PageSource.

Doing this returns the correct HTML:

driver.Navigate().GoToUrl("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string renderedContent = driver.PageSource;

But I don't want to download the data twice. The crawler (Abot) downloads the HTML and I just want to parse/render the javascript and ajax.

Thank you!

Without running it, I would bet you need file:/// prior to tmpPath. That is:

    WebClient wc = new WebClient();
    string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
    string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
    File.WriteAllText(tmpPath, content);

    var driverService = PhantomJSDriverService.CreateDefaultService();            
    var driver = new PhantomJSDriver(driverService);
    driver.Navigate().GoToUrl(new Uri("file:///" + tmpPath));
    string renderedContent = driver.PageSource;
    driver.Quit();

You probably need to allow PhantomJS to make arbitrary requests. Requests are blocked when the domain/protocol doesn't match as is the case when a local file is opened.

var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.LocalToRemoteUrlAccess = true;
driverService.WebSecurity = false; // may not be necessary
var driver = new PhantomJSDriver(driverService);

You might need to combine this with the solution of Dave Bush:

driver.Navigate().GoToUrl(new Uri("file:///" + tmpPath));

Some of the resources have URLs that begin with // which means that the protocol of the page is used when the browser retrieves those resources. When a local file is read, this protocol is file:// in which case none of those resources will be found. The protocol must be added to the local file in order to download all those resources.

File.WriteAllText(tmpPath, content.Replace('"//', '"http://'));

It is apparent from your output that you use PhantomJS 1.9.8. It may be the case that a newly introduced bug is responsible for this sort of thing. You should user PhantomJS 1.9.7 with driverService.SslProcotol = 'tlsv1' .


You should also enable the disk cache if you do this multiple times for the same domain. Otherwise, the resources are downloaded each time you try to scrape it. This can be done with driverService.DiskCache = true;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM