您如何从带有节点puppeteer的页面获取所有链接？

Question

I'm trying to build a web crawler with node and came across the puppeteer package which looks perfect for what I want. 我正在尝试使用node构建Web搜寻器，并遇到了puppeteer程序包，该程序包非常适合我想要的内容。 My end result is to gather all the links from a page, all of its text content, and then a screenshot of the page itself. 我的最终结果是收集页面的所有链接，页面的所有文本内容，然后是页面本身的屏幕截图。

I ran the following and it appears to gather a large number of links, however on actual inspection of the site there are links that it is not gathering. 我执行了以下操作，它似乎收集了大量链接，但是在实际检查站点时，有一些链接没有收集。

const puppeteer = require('puppeteer');

module.exports = () => {
  (async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://pixabay.com/en/columbine-columbines-aquilegia-3379045/');
    await page.screenshot({ path: 'myscreenshot.png', fullPage: true });
    let text = await page.$eval('*', el => el.innerText.split(' '));
    text = text.map(string => {
      return string.replace(/[^\w\s]/gi, '');
    });

      let hrefs = await page.evaluate(() => {
          const links = Array.from(document.querySelectorAll('a'))
          return links.map(link => link.href);
      });
    console.log('done');

    await browser.close();
  })();
};

for example this link : /go/?t=image-details-shutterstock&id=699165328 is nowhere in the array of hrefs. 例如，以下链接： /go/?t=image-details-shutterstock&id=699165328在href数组中不存在。 What's worse is these are links that lead out of the site, the exact type of thing I want to do, otherwise I'm stuck only crawling the one site. 更糟糕的是，这些是引出该站点的链接，是我想要做的确切类型，否则，我只能抓取一个站点。

Is there a reason my script is only showing some of the links? 我的脚本仅显示某些链接是有原因的吗？ is the querySelector too narrow or rejecting certain links? 查询选择器是否太狭窄或拒绝某些链接？

Answer 1

That links are generated by onclick event, it saved in data-go attribute, for example 链接是由onclick事件生成的，例如保存在data-go属性中

<a data-go="image-details-shutterstock&amp;id=458320033">

It only need to prepend /go/?t= and to get it 它只需要添加/go/?t=并获得它

return links.map(link => link.href || link.getAttribute('data-go'));

there are also empty link for menu like 也有菜单的空白链接，例如

<a><i class="icon icon_menu_user"></i></a>

您如何从带有节点puppeteer的页面获取所有链接？

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-12-16 01:47:14

您如何从带有节点puppeteer的页面获取所有链接？

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-12-16 01:47:14

解决方案1
0 已采纳 2018-12-16 01:47:14