如何抓取 javascript 哈希链接内容？

Question

Hi im a bit new in web scraping using Puppeter im currently im facing the next problem:嗨，我在使用 Puppeter 进行网络抓取方面有点新，我目前面临下一个问题：

in the site where im trying to extract information i have a bootstrap table with a typical js pagination like the examples from: https://getbootstrap.com/docs/4.1/components/pagination/在我试图提取信息的站点中，我有一个带有典型 js 分页的引导表，例如以下示例： https : //getbootstrap.com/docs/4.1/components/pagination/

when i check the page html with Chrome inspector all i can see is 2 and when i check link location i see当我用 Chrome 检查器检查页面 html 时，我只能看到2 ，当我检查链接位置时，我看到

https://webpage.com/works# https://webpage.com/works#

how i can know how many pages are in total?我怎么知道总共有多少页？ and how i can click them?我如何点击它们？ i don't understand how i can visit every page for this type of pagination.我不明白如何访问这种类型的分页的每一页。

Thanks!谢谢！

Answer 1

There is no foolproof way, but I deal with pagination in this order,没有万无一失的方法，但我按这个顺序处理分页，

Wait for the target element to appear等待目标元素出现
Collect the data from target从目标收集数据
Remove the target element移除目标元素
Click next button点击下一步按钮
...loop thru until there are no next button or content doesn't load even after wait ...循环直到没有下一个按钮或即使等待后内容也没有加载

Proof of concept:概念证明：

Target HTML Code:目标 HTML 代码：

 <!-- Copied from: https://jsfiddle.net/solodev/yw7y4wez --> <!DOCTYPE html> <html> <head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <title>Pagination Example</title> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <meta name="robots" content="noindex, nofollow"> <meta name="googlebot" content="noindex, nofollow"> <meta name="viewport" content="width=device-width, initial-scale=1"> <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script> <link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css"> <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script> <script type="text/javascript" src="https://www.solodev.com/assets/pagination/jquery.twbsPagination.js"></script> <style type="text/css"> .container { margin-top: 20px; } .page { display: none; } .page-active { display: block; } </style> <script type="text/javascript"> window.onload = function() { $('#pagination-demo').twbsPagination({ totalPages: 5, // the current page that show on start startPage: 1, // maximum visible pages visiblePages: 5, initiateStartPageClick: true, // template for pagination links href: false, // variable name in href template for page number hrefVariable: '{{number}}', // Text labels first: 'First', prev: 'Previous', next: 'Next', last: 'Last', // carousel-style pagination loop: false, // callback function onPageClick: function(event, page) { $('.page-active').removeClass('page-active'); $('#page' + page).addClass('page-active'); }, // pagination Classes paginationClass: 'pagination', nextClass: 'next', prevClass: 'prev', lastClass: 'last', firstClass: 'first', pageClass: 'page', activeClass: 'active', disabledClass: 'disabled' }); } </script> </head> <body> <div class="container"> <div class="jumbotron page" id="page1"> <div class="container"> <h1 class="display-3">Adding Pagination to your Website</h1> <p class="lead">In this article we teach you how to add pagination, an excellent way to navigate large amounts of content, to your website using a jQuery Bootstrap Plugin.</p> <p><a class="btn btn-lg btn-success" href="https://www.solodev.com/blog/web-design/adding-pagination-to-your-website.stml" role="button">Learn More</a></p> </div> </div> <div class="jumbotron page" id="page2"> <h1 class="display-3">Not Another Jumbotron</h1> <p class="lead">Cras justo odio, dapibus ac facilisis in, egestas eget quam. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p> <p><a class="btn btn-lg btn-success" href="#" role="button">Sign up today</a></p> </div> <div class="jumbotron page" id="page3"> <h1 class="display-3">Data. Data. Data.</h1> <p>This example is a quick exercise to illustrate how the default responsive navbar works. It's placed within a <code>.container</code> to limit its width and will scroll with the rest of the page's content. </p> <p> <a class="btn btn-lg btn-primary" href="../../components/navbar" role="button">View navbar docs »</a> </p> </div> <div class="jumbotron page" id="page4"> <h1 style="-webkit-user-select: auto;">Buy Now!</h1> <p class="lead" style="-webkit-user-select: auto;">Cras justo odio, dapibus ac facilisis in, egestas eget quam. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet.</p> <p style="-webkit-user-select: auto;"><a class="btn btn-lg btn-success" href="#" role="button" style="-webkit-user-select: auto;">Get started today</a></p> </div> <div class="jumbotron page" id="page5"> <h1 class="cover-heading">Cover your page.</h1> <p class="lead">Cover is a one-page template for building simple and beautiful home pages. Download, edit the text, and add your own fullscreen background photo to make it your own.</p> <p class="lead"> <a href="#" class="btn btn-lg btn-primary">Learn more</a> </p> </div> <ul id="pagination-demo" class="pagination-lg pull-right"></ul> </div> <script> // tell the embed parent frame the height of the content if (window.parent && window.parent.parent) { window.parent.parent.postMessage(["resultsFrame", { height: document.body.getBoundingClientRect().height, slug: "yw7y4wez" }], "*") } </script> </body> </html>

Here is a sample working version of code,这是代码的示例工作版本，

const puppeteer = require('puppeteer');

async function runScraper() {
  let browser = {};
  let page = {};
  const url = 'http://localhost:8080';

  // open the page and wait
  async function navigate() {
    browser = await puppeteer.launch({ headless: false });
    page = await browser.newPage();
    await page.goto(url);
  }

  async function scrapeData() {
    const headerSel = 'h1';
    // wait for element
    await page.waitFor(headerSel);
    return page.evaluate((selector) => {
      const target = document.querySelector(selector);

      // get the data
      const text = target.innerText;

      // remove element so the waiting function works
      target.remove();
      return text;
    }, headerSel);
  }

  // this is a sample concept of pagination
  // it will vary from page to page because not all site have same type of pagination

  async function paginate() {
    // manually check if the next button is available or not
    const nextBtnDisabled = !!(await page.$('.next.disabled'));
    if (!nextBtnDisabled) {
      // since it's not disable, click it
      await page.evaluate(() => document.querySelector('.next').click());

      // just some random waiting function
      await page.waitFor(100);
      return true;
    }
    console.log({ nextBtnDisabled });
  }

  /**
   * Scraping Logic
   */
  await navigate();

  // Scrape 5 pages
  for (const pageNum of [...Array(5).keys()]) {
    const title = await scrapeData();
    console.log(pageNum + 1, title);
    await paginate();
  }
}

runScraper();

Result:结果：

Server running at 8080
1 'Adding Pagination to your Website'
2 'Not Another Jumbotron'
3 'Data. Data. Data.'
4 'Buy Now!'
5 'Cover your page.'
{ nextBtnDisabled: true }

I did not share the server code, it's basically the html snippet above.我没有分享服务器代码，它基本上是上面的 html 片段。

Answer 2

use attribute footerTemplate with displayHeaderFooter for show pages originally using puppeteer API使用属性footerTemplate和displayHeaderFooter显示最初使用puppeteer API 的页面

await page.pdf({
  path: 'hacks.pdf',
  format: 'A4',
  displayHeaderFooter: true,
  footerTemplate: '<div><div class='pageNumber'></div> <div>/</div><div class='totalPages'></div></div>'
});

https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagepdfoptions https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagepdfoptions

footerTemplate HTML template for the print footer. footerTemplate打印页脚的 HTML 模板。

// Should be valid HTML markup with following CSS classes used to inject printing values into them: // 应该是有效的 HTML 标记，具有以下用于将打印值注入其中的CSS 类：

// - date formatted print date // -日期格式的打印日期

// - title document title // -标题文档标题

// - url document location // - url文档位置

// - pageNumber current page number // - pageNumber当前页码

// - totalPages total pages in the document // - totalPages文档中的总页数

如何抓取 javascript 哈希链接内容？

问题描述

2 个解决方案

解决方案1
0 2018-10-19 04:45:05

Proof of concept:概念证明：

解决方案2
0 2020-01-27 09:23:42

如何抓取 javascript 哈希链接内容？

问题描述

2 个解决方案

解决方案1 0 2018-10-19 04:45:05

Proof of concept:概念证明：

解决方案2 0 2020-01-27 09:23:42

解决方案1
0 2018-10-19 04:45:05

解决方案2
0 2020-01-27 09:23:42