简体   繁体   English

如何抓取 javascript 哈希链接内容?

[英]How to scrape javascript hash links content?

Hi im a bit new in web scraping using Puppeter im currently im facing the next problem:嗨,我在使用 Puppeter 进行网络抓取方面有点新,我目前面临下一个问题:

in the site where im trying to extract information i have a bootstrap table with a typical js pagination like the examples from: https://getbootstrap.com/docs/4.1/components/pagination/在我试图提取信息的站点中,我有一个带有典型 js 分页的引导表,例如以下示例: https : //getbootstrap.com/docs/4.1/components/pagination/

when i check the page html with Chrome inspector all i can see is 2 and when i check link location i see当我用 Chrome 检查器检查页面 html 时,我只能看到2 ,当我检查链接位置时,我看到

https://webpage.com/works# https://webpage.com/works#

how i can know how many pages are in total?我怎么知道总共有多少页? and how i can click them?我如何点击它们? i don't understand how i can visit every page for this type of pagination.我不明白如何访问这种类型的分页的每一页。

Thanks!谢谢!

There is no foolproof way, but I deal with pagination in this order,没有万无一失的方法,但我按这个顺序处理分页,

  • Wait for the target element to appear等待目标元素出现
  • Collect the data from target从目标收集数据
  • Remove the target element移除目标元素
  • Click next button点击下一步按钮
  • ...loop thru until there are no next button or content doesn't load even after wait ...循环直到没有下一个按钮或即使等待后内容也没有加载

Proof of concept:概念证明:

Target HTML Code:目标 HTML 代码:

 <!-- Copied from: https://jsfiddle.net/solodev/yw7y4wez --> <!DOCTYPE html> <html> <head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <title>Pagination Example</title> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <meta name="robots" content="noindex, nofollow"> <meta name="googlebot" content="noindex, nofollow"> <meta name="viewport" content="width=device-width, initial-scale=1"> <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script> <link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css"> <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script> <script type="text/javascript" src="https://www.solodev.com/assets/pagination/jquery.twbsPagination.js"></script> <style type="text/css"> .container { margin-top: 20px; } .page { display: none; } .page-active { display: block; } </style> <script type="text/javascript"> window.onload = function() { $('#pagination-demo').twbsPagination({ totalPages: 5, // the current page that show on start startPage: 1, // maximum visible pages visiblePages: 5, initiateStartPageClick: true, // template for pagination links href: false, // variable name in href template for page number hrefVariable: '{{number}}', // Text labels first: 'First', prev: 'Previous', next: 'Next', last: 'Last', // carousel-style pagination loop: false, // callback function onPageClick: function(event, page) { $('.page-active').removeClass('page-active'); $('#page' + page).addClass('page-active'); }, // pagination Classes paginationClass: 'pagination', nextClass: 'next', prevClass: 'prev', lastClass: 'last', firstClass: 'first', pageClass: 'page', activeClass: 'active', disabledClass: 'disabled' }); } </script> </head> <body> <div class="container"> <div class="jumbotron page" id="page1"> <div class="container"> <h1 class="display-3">Adding Pagination to your Website</h1> <p class="lead">In this article we teach you how to add pagination, an excellent way to navigate large amounts of content, to your website using a jQuery Bootstrap Plugin.</p> <p><a class="btn btn-lg btn-success" href="https://www.solodev.com/blog/web-design/adding-pagination-to-your-website.stml" role="button">Learn More</a></p> </div> </div> <div class="jumbotron page" id="page2"> <h1 class="display-3">Not Another Jumbotron</h1> <p class="lead">Cras justo odio, dapibus ac facilisis in, egestas eget quam. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p> <p><a class="btn btn-lg btn-success" href="#" role="button">Sign up today</a></p> </div> <div class="jumbotron page" id="page3"> <h1 class="display-3">Data. Data. Data.</h1> <p>This example is a quick exercise to illustrate how the default responsive navbar works. It's placed within a <code>.container</code> to limit its width and will scroll with the rest of the page's content. </p> <p> <a class="btn btn-lg btn-primary" href="../../components/navbar" role="button">View navbar docs »</a> </p> </div> <div class="jumbotron page" id="page4"> <h1 style="-webkit-user-select: auto;">Buy Now!</h1> <p class="lead" style="-webkit-user-select: auto;">Cras justo odio, dapibus ac facilisis in, egestas eget quam. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet.</p> <p style="-webkit-user-select: auto;"><a class="btn btn-lg btn-success" href="#" role="button" style="-webkit-user-select: auto;">Get started today</a></p> </div> <div class="jumbotron page" id="page5"> <h1 class="cover-heading">Cover your page.</h1> <p class="lead">Cover is a one-page template for building simple and beautiful home pages. Download, edit the text, and add your own fullscreen background photo to make it your own.</p> <p class="lead"> <a href="#" class="btn btn-lg btn-primary">Learn more</a> </p> </div> <ul id="pagination-demo" class="pagination-lg pull-right"></ul> </div> <script> // tell the embed parent frame the height of the content if (window.parent && window.parent.parent) { window.parent.parent.postMessage(["resultsFrame", { height: document.body.getBoundingClientRect().height, slug: "yw7y4wez" }], "*") } </script> </body> </html>

Here is a sample working version of code,这是代码的示例工作版本,

const puppeteer = require('puppeteer');

async function runScraper() {
  let browser = {};
  let page = {};
  const url = 'http://localhost:8080';

  // open the page and wait
  async function navigate() {
    browser = await puppeteer.launch({ headless: false });
    page = await browser.newPage();
    await page.goto(url);
  }

  async function scrapeData() {
    const headerSel = 'h1';
    // wait for element
    await page.waitFor(headerSel);
    return page.evaluate((selector) => {
      const target = document.querySelector(selector);

      // get the data
      const text = target.innerText;

      // remove element so the waiting function works
      target.remove();
      return text;
    }, headerSel);
  }

  // this is a sample concept of pagination
  // it will vary from page to page because not all site have same type of pagination

  async function paginate() {
    // manually check if the next button is available or not
    const nextBtnDisabled = !!(await page.$('.next.disabled'));
    if (!nextBtnDisabled) {
      // since it's not disable, click it
      await page.evaluate(() => document.querySelector('.next').click());

      // just some random waiting function
      await page.waitFor(100);
      return true;
    }
    console.log({ nextBtnDisabled });
  }

  /**
   * Scraping Logic
   */
  await navigate();

  // Scrape 5 pages
  for (const pageNum of [...Array(5).keys()]) {
    const title = await scrapeData();
    console.log(pageNum + 1, title);
    await paginate();
  }
}

runScraper();

Result:结果:

Server running at 8080
1 'Adding Pagination to your Website'
2 'Not Another Jumbotron'
3 'Data. Data. Data.'
4 'Buy Now!'
5 'Cover your page.'
{ nextBtnDisabled: true }

I did not share the server code, it's basically the html snippet above.我没有分享服务器代码,它基本上是上面的 html 片段。

use attribute footerTemplate with displayHeaderFooter for show pages originally using puppeteer API使用属性footerTemplatedisplayHeaderFooter显示最初使用puppeteer API 的页面

await page.pdf({
  path: 'hacks.pdf',
  format: 'A4',
  displayHeaderFooter: true,
  footerTemplate: '<div><div class='pageNumber'></div> <div>/</div><div class='totalPages'></div></div>'
});

https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagepdfoptions https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagepdfoptions

footerTemplate HTML template for the print footer. footerTemplate打印页脚的 HTML 模板。

// Should be valid HTML markup with following CSS classes used to inject printing values into them: // 应该是有效的 HTML 标记,具有以下用于将打印值注入其中的CSS 类

// - date formatted print date // -日期格式的打印日期

// - title document title // -标题文档标题

// - url document location // - url文档位置

// - pageNumber current page number // - pageNumber当前页码

// - totalPages total pages in the document // - totalPages文档中的总页数

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM