简体   繁体   English

在抓取 Google 搜索时更改设置特定用户代理字符串时,CasperJS 返回不同的结果

[英]CasperJS returns different results when change setting a specific user agent string when scraping Google search

I'm loading a Google search page with a preset search term ("Apples").我正在加载带有预设搜索词(“Apples”)的 Google 搜索页面。 Then I want to type into the search box to find something else, but it doesn't behave as expected (detailed description below the code).然后我想在搜索框中键入以查找其他内容,但它的行为不符合预期(代码下方的详细说明)。

var links = [];
var casper = require('casper').create({
    // verbose: true, 
    // logLevel: "debug" 
    // pageSettings: {
    //  userAgent: 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5'
    // }
});

function getLinks() {
    var links = document.querySelectorAll('h3.r a');
    return Array.prototype.map.call(links, function(e) {
        return e.innerText;
    });
}

casper.start('https://www.google.com/#safe=off&q=Apples', function() {
    // search for 'casperjs' from google form
    this.fill('form[action="/search"]', { q: 'casperjs' }, true);
    casper.capture('screenshot/googleresults1.png');

});

casper.then(function() {
    // aggregate results for the 'casperjs' search
    links = this.evaluate(getLinks);
    casper.capture('screenshot/googleresults2.png');
    // now search for 'phantomjs' by filling the form again
    this.fill('form[action="/search"]', { q: 'phantomjs' }, true);

});

casper.then(function() {
    // aggregate results for the 'phantomjs' search
    links = links.concat(this.evaluate(getLinks));
});

casper.run(function() {
    // echo results in some pretty fashion
    this.echo(links.length + ' links found:');
    casper.capture('screenshot/googleresults3.png');
    this.echo(' - ' + links.join('\n - ')).exit();
});

The bugs I experienced:我遇到的错误:

  • Including User Agent in .create() gives me no results in console.在 .create() 中包含用户代理在控制台中没有给我任何结果。
  • Commenting out User Agent but including Verbose and Loglevel,gives me "Apples" results注释掉 User Agent 但包括 Verbose 和 Loglevel,给我“Apples”结果
  • Commenting out everything gives me the right results (Casperjs and Phantomjs)注释掉所有内容会给我正确的结果(Casperjs 和 Phantomjs)

My questions:我的问题:

  1. I don't understand why turning on both Verbose and LogLevel gives me "Apples" results as you can see in the casper.start function.我不明白为什么同时打开 Verbose 和 LogLevel 会给我“Apples”结果,正如您在 casper.start 函数中看到的那样。
  2. Why does turning on User Agent give me 0 results?为什么打开用户代理给我 0 结果?

Is anyone else getting this?有其他人得到这个吗? As you see, the right results should be Casperjs and Phantomjs through both the fill functions entered in the search box.如您所见,通过在搜索框中输入的两个填充函数,正确的结果应该是 Casperjs 和 Phantomjs。

Screenshots of my 3 captures我的 3 个捕获的屏幕截图截图1
截图2
截图3

After repeating the program in my console a few times, on some occasions, it appears the 1st fill action does not proceed.在我的控制台中重复该程序几次后,在某些情况下,第 1 次填充操作似乎没有进行。 therefore, it scrapes Apple.因此,它刮掉了苹果。 However, I wonder why is this?不过,我想知道这是为什么? Should I change to use another function instead?我应该改为使用其他功能吗?

Google delivers different pages depending on the user agent, viewport size and other metrics. Google 根据用户代理、视口大小和其他指标提供不同的页面。

The different pages can manifest themselves in additional JavaScript which does not run correctly in PhantomJS (clicking and submitting stuff is always a problem).不同的页面可以在额外的 JavaScript 中表现出来,而这些 JavaScript 在 PhantomJS 中无法正确运行(点击和提交内容总是一个问题)。 It is also possible that elements are added, removed or their IDs changed between different configurations (user agent, viewport size).也可能在不同的配置(用户代理、视口大小)之间添加、删除元素或更改它们的 ID。

You should take screenshots ( casper.capture(filename) ) and safe the current page source ( fs.write(filename, casper.getHTML()) ) to see whether there are differences compared to what you see in your desktop browser.您应该截取屏幕截图 ( casper.capture(filename) ) 并保护当前页面源 ( fs.write(filename, casper.getHTML()) ) 以查看与您在桌面浏览器中看到的内容相比是否存在差异。


Specific issues in your script:脚本中的特定问题:

  • If there is no page load, then you should use one of the casper.wait* functions to wait for the changed content.如果没有页面加载,那么您应该使用casper.wait*函数之一来等待更改的内容。 casper.then() is a asynchronous step function that usually only catches full page loads. casper.then()是一个异步步进函数,通常只捕获整页加载。
    On that note, casper.fill() is finishes immediately, but the page may take a while until the typed in content is actually loaded.在这一点上, casper.fill()立即完成,但页面可能需要一段时间才能真正加载输入的内容。 Therefore, using casper.capture() immediately after casper.fill() will not give the intended result.因此,在casper.fill()之后立即使用casper.capture()不会给出预期的结果。

  • this inside of a CasperJS function always refers to casper . this一个CasperJS函数内总是指casper So, you can use them interchangeably.因此,您可以互换使用它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM