简体   繁体   English

phantomjs:document.querySelectorAll() 不适用于动态页面

[英]phantomjs: document.querySelectorAll() not working for dynamic page

I am just trying to get deals items from this amazon URL :我只是想从这个亚马逊 URL获得交易项目:

when I open this link in browser and write the query in console, it works: document.querySelectorAll('div[class*="DealItem-module__dealItem_"]')当我在浏览器中打开此链接并在控制台中编写查询时,它可以工作: document.querySelectorAll('div[class*="DealItem-module__dealItem_"]')

document.querySelectorAll() 结果

but when I try to fetch this through this phantomjs script, it seems to always returning nothing:但是当我尝试通过这个phantomjs脚本来获取它时,它似乎总是什么都不返回:

var page = require('webpage').create();

page.viewportSize = { height: 800, width: 1920 }; // BRODIE : CHROME

page.customHeaders = {
  accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  // 'accept-encoding': 'gzip, deflate, br',
  'accept-language': 'en-US,en;q=0.9',
  dnt: '1',
  'sec-ch-ua':
    '" Not A;Brand";v="99", "Chromium";v="90", "Microsoft Edge";v="90"',
  'sec-ch-ua-mobile': '?0',
  'sec-fetch-dest': 'document',
  'sec-fetch-mode': 'navigate',
  'sec-fetch-site': 'none',
  'sec-fetch-user': '?1',
  'upgrade-insecure-requests': '1',
  'user-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36 Edg/90.0.818.66',
};

page.settings.javascriptEnabled = true;
page.settings.loadImages = false;
//Script is much faster with this field set to false
phantom.cookiesEnabled = true;
phantom.javascriptEnabled = true;

page.onConsoleMessage = function (message) {
  console.log('console.log() -- ', message);
}; // BUBBLE UP LOGS FROM BROWSER CONSOLE TO PHANTOM CONSOLE

page.onLoadStarted = function () {
  loadInProgress = true;
  console.log('page loading started');
};
page.onLoadFinished = function () {
  loadInProgress = false;
  console.log('page loading finished');
};

page.onError = function (msg, trace) {
  console.log(msg);
  trace.forEach(function (item) {
    console.log('  ', item.file, ':', item.line);
  });
};

// OPEN PAGE
console.log('page.open()');
page.open(
  'https://www.amazon.com/gp/goldbox/ref=gbps_ftr_s-5_cd34_wht_26179410?gb_f_deals1=sortOrder:BY_SCORE,includedAccessTypes:GIVEAWAY_DEAL,enforcedCategories:2617941011&pf_rd_p=fd51d8cf-b5df-4144-8086-80096db8cd34&pf_rd_s=slot-5&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=A89BX6V6RQRQ94NFA0DP&ie=UTF8',
  function (status) {
    if (status !== 'success')
      console.log('U N A B L E   T O   O P E N   P A G E . . .');
    else console.log(' P A G E   O P E N E D . . .');

    var selector = 'div[class*="DealItem-module__dealItem_"]'

    var findAll = setInterval(function () {
      console.log('trying to fetch deals...');
      var deals = page.evaluate(function (sel) {
        return document.querySelectorAll(
          'div[class*="DealItem-module__dealItem_"]'
        );
      }, selector);

      if(deals.length) {
        console.log('deals.length', deals.length);
        clearInterval(findAll);
      }
    }, 1000);
  }
);

Also, when I try to take screenshot using page.render() , it shows page with unloaded/unfinished JS ( which is different from when we type that URL in browser and search: ):此外,当我尝试使用page.render()截屏时,它会显示带有未加载/未完成 JS 的页面(这与我们在浏览器中键入 URL 并搜索时不同:)

未完成/未加载的 JS 页面

Also, I noticed that when I run this script in terminal, I get some JS errors of webpage:另外,我注意到当我在终端中运行这个脚本时,我得到了一些网页的 JS 错误:

幻影JS网页错误

Any help will be greatly appriciated任何帮助将不胜感激

According to the documentation on the evaluate method in PhantomJS根据PhantomJS 中评估方法的文档

Note: The arguments and the return value to the evaluate function must be a simple primitive object.注意:arguments 和评估 function 的返回值必须是简单的原语 object。 The rule of thumb: if it can be serialized via JSON, then it is fine.经验法则:如果可以通过JSON序列化,那就没问题了。

Closures, functions, DOM nodes, etc. will not work!闭包、函数、DOM 节点等将不起作用!

Instead, you should perform your length calculation inside the evaluate, then return the simple primitive length.相反,您应该在评估中执行长度计算,然后返回简单的原始长度。

Thanks for the answers Leftium and James, I've tried waitFor.js and other suggestions on Stack Overflow.感谢 Leftium 和 James 的回答,我在 Stack Overflow 上尝试过waitFor.js和其他建议。 But none of them worked.但他们都没有工作。 Now I am using Nightmare.js and it's working now, using Nighmare.js - Asynchronous operations and loops and Looping through pages when next is available #402现在我正在使用Nightmare.js ,它现在正在工作,使用Nighmare.js - 异步操作和循环以及在下一个可用时循环页面 #402

But knowing how to do it with phantom.js will be nice, though但是知道如何使用 phantom.js 会很好,不过

The reason document.querySelectorAll('div[class*="DealItem-module__dealItem_"]) only returns results in the browser console and not the PhantomJS script is because they are running on two different versions of the page: document.querySelectorAll('div[class*="DealItem-module__dealItem_"])仅在浏览器控制台中返回结果而不是 PhantomJS 脚本的原因是因为它们在两个不同版本的页面上运行:

  1. In the phantomjs script, you are not logged in. So Amazon shows a "Sign In" page instead of the list of deals.在 phantomjs 脚本中,您没有登录。因此亚马逊显示“登录”页面而不是交易列表。
    • This is confirmed by the screenshot from PhantomJS page.render(). PhantomJS page.render() 的截图证实了这一点。
  2. In the browser, you are logged in so the Amazon site shows a list of deals that includes DOM elements that match this query.在浏览器中,您已登录,因此亚马逊网站会显示包含与此查询匹配的 DOM 元素的交易列表。
    • You may further confirm document.querySelectorAll() does not return anything in the browser if you are logged out of Amazon or using an incognito browser.如果您退出亚马逊或使用隐身浏览器,您可以进一步确认document.querySelectorAll()不会在浏览器中返回任何内容。 (Interestingly, that Amazon URL does show a list of deals for me while logged out in incognito mode. Amazon may only show that sign in message if it suspects an automated bot is accessing the URL...) (有趣的是,亚马逊 URL 确实在隐身模式下为我显示了交易列表。亚马逊可能只会在怀疑自动机器人正在访问 URL 时才显示该登录消息......)

To get the PhantomJS script to scrape the same page as the one you see in your browser, you must first sign in to Amazon on the PhantomJS headless browser.要让 PhantomJS 脚本抓取与您在浏览器中看到的页面相同的页面,您必须首先在 PhantomJS 无头浏览器上登录 Amazon。 (PhantomJS probably uses a different browser executable than the one your browser uses.) There are a few different ways to do this: (PhantomJS 可能使用与您的浏览器使用的不同的浏览器可执行文件。)有几种不同的方法可以做到这一点:

  • Manually sign in from the PhantomJS script.从 PhantomJS 脚本手动登录。 This is not simple;这并不简单; you must:你必须:
    1. Load the sign in page.加载登录页面。
    2. Find the user id field and fill in your id.找到用户 ID 字段并填写您的 ID。
    3. Find the password field and fill in your password.找到密码字段并填写您的密码。
    4. Find the submit button and click it.找到提交按钮并单击它。
    5. Solve any CAPTCHA's Amazon challenges you with.解决您遇到的任何 CAPTCHA 亚马逊挑战。
  • Get the Amazon session cookies after you sign into Amazon in your browser, then use those cookies in your PhantomJS script.在浏览器中登录 Amazon 后获取 Amazon session cookies,然后在 PhantomJS 脚本中使用这些 cookies。 This is generally easier, but doesn't always work and must be repeated each time the cookies expire.这通常更容易,但并不总是有效,每次 cookies 到期时都必须重复。
  • Tell PhantomJS to use the browser with your Amazon session cookies.告诉 PhantomJS 将浏览器与您的 Amazon session cookies 一起使用。 I am not sure if it is possible with PhantomJS, but this is a configurable NickJS setting (another scriptable headless browser).我不确定 PhantomJS 是否可行,但这是一个可配置的NickJS 设置(另一个可编写脚本的无头浏览器)。

Since Amazon sometimes show the list of deals even when not signed in, you may be able to get the list of deals without signing in by making PhantomJS appear like a real browser: ensure PhantomJS sends all the cookies and User Agent string like a real browser.由于亚马逊有时会在未登录的情况下显示交易列表,因此您可以通过使 PhantomJS 看起来像真正的浏览器来获得交易列表而无需登录:确保 PhantomJS 像真正的浏览器一样发送所有 cookies 和用户代理字符串.

Finally: large sites like Amazon and Google are very good at detecting and preventing automated bots from scraping their sites.最后:像亚马逊和谷歌这样的大型网站非常擅长检测和防止自动机器人抓取他们的网站。 You will likely face many more obstacles in the future!未来你可能会面临更多的障碍!


update:更新:

I just checked the Amazon URL, and there are indeed HTTP-only cookies.我刚刚检查了 Amazon URL,确实有 HTTP-only cookies。 This type of cookie cannot be accessed (neither read nor written) from JavaScript.无法从 JavaScript 访问(既不读取也不写入)这种类型的 cookie。 So there is a good chance PhantomJS cannot read/write these cookies, aside from manually logging in via the PhantomJS script:因此,除了通过 PhantomJS 脚本手动登录之外,PhantomJS 很有可能无法读取/写入这些 cookies:

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM