简体   繁体   English

您如何从带有节点puppeteer的页面获取所有链接?

[英]How do you get all the links from a page with node puppeteer?

I'm trying to build a web crawler with node and came across the puppeteer package which looks perfect for what I want. 我正在尝试使用node构建Web搜寻器,并遇到了puppeteer程序包,该程序包非常适合我想要的内容。 My end result is to gather all the links from a page, all of its text content, and then a screenshot of the page itself. 我的最终结果是收集页面的所有链接,页面的所有文本内容,然后是页面本身的屏幕截图。

I ran the following and it appears to gather a large number of links, however on actual inspection of the site there are links that it is not gathering. 我执行了以下操作,它似乎收集了大量链接,但是在实际检查站点时,有一些链接没有收集。

const puppeteer = require('puppeteer');

module.exports = () => {
  (async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://pixabay.com/en/columbine-columbines-aquilegia-3379045/');
    await page.screenshot({ path: 'myscreenshot.png', fullPage: true });
    let text = await page.$eval('*', el => el.innerText.split(' '));
    text = text.map(string => {
      return string.replace(/[^\w\s]/gi, '');
    });

      let hrefs = await page.evaluate(() => {
          const links = Array.from(document.querySelectorAll('a'))
          return links.map(link => link.href);
      });
    console.log('done');

    await browser.close();
  })();
};

for example this link : /go/?t=image-details-shutterstock&id=699165328 is nowhere in the array of hrefs. 例如,以下链接: /go/?t=image-details-shutterstock&id=699165328在href数组中不存在。 What's worse is these are links that lead out of the site, the exact type of thing I want to do, otherwise I'm stuck only crawling the one site. 更糟糕的是,这些是引出该站点的链接,是我想要做的确切类型,否则,我只能抓取一个站点。

Is there a reason my script is only showing some of the links? 我的脚本仅显示某些链接是有原因的吗? is the querySelector too narrow or rejecting certain links? 查询选择器是否太狭窄或拒绝某些链接?

That links are generated by onclick event, it saved in data-go attribute, for example 链接是由onclick事件生成的,例如保存在data-go属性中

<a data-go="image-details-shutterstock&amp;id=458320033">

It only need to prepend /go/?t= and to get it 它只需要添加/go/?t=并获得它

return links.map(link => link.href || link.getAttribute('data-go'));

there are also empty link for menu like 也有菜单的空白链接,例如

<a><i class="icon icon_menu_user"></i></a>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 puppeteer 从网站获取所有链接 - How to get all links from a website with puppeteer Puppeteer 获取所有<a>href链接</a> - Puppeteer get all <a> href links Puppeteer:从页面中删除链接 - Puppeteer: Remove links from page 如何使用 Puppeteer 捕获页面中的所有链接? - How can I capture all links in a page with Puppeteer? 如何从 iframe(傀儡师)中获取多个标签 href 链接(以数组的形式)? - how do I get multiple a tag href links (in form of an array) from inside an iframe (puppeteer)? 如何将 object 从浏览器环境返回到 Puppeteer 中的 Node 环境? - How do you return an object from the browser environment to the Node environment in Puppeteer? Puppeteer - 如何从 ElementHandle 获取页面 - Puppeteer - How to get Page from ElementHandle 获取与Puppeteer中的XPath的所有链接(暂停还是无法正常工作)? - Get all links with XPath in Puppeteer (pausing or not working)? 如何从URL获取变量的值并传递到整个站点上的所有链接? 但是如果您只是访问主页,则忽略它 - How to get variable's value from URL and pass to all the links on the whole site? but ignore it if you just visit the home page Node/Puppeteer:尝试使用选择器获取所有链接,获取结果属性 - Node/Puppeteer: trying to get all links using selector, getting attribute of results
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM