简体   繁体   English

使用 puppeteer 抓取公司数据

[英]Web-scraping Company data with puppeteer

I am trying to get the company data from this website called similar web but upon making a lot of requests it recognizes my script as a bot so is there any way to bypass this check?我正在尝试从这个名为类似网站的网站获取公司数据,但是在提出很多请求后,它会将我的脚本识别为机器人,所以有什么办法可以绕过这个检查吗? or suggest any website to scrap data easily, we can't use LinkedIn by the way.或建议任何网站轻松删除数据,我们不能顺便使用LinkedIn。

 const puppeteer = require("puppeteer"); const searchCompany = "zoominfo.com"; const Link = `https://www.similarweb.com/website/${searchCompany}/#overview`; // console.log(companyPage); let page; (async function () { try { let browserOpen = await puppeteer.launch({ headless: false, // dumpio: true, // args: ["--start-maximized"], defaultViewport: null, }); let newTab = await browserOpen.newPage(); await newTab.goto(Link); await newTab.screenshot({ path: "sc.png" }); await newTab.waitForSelector(".data-company-info__row"); let ans = await newTab.evaluate(() => { let name = document.querySelectorAll(".data-company-info__row")[0] .textContent; let location = document.querySelectorAll(".data-company-info__row")[3] .textContent; let industry = document.querySelectorAll(".data-company-info__row")[5] .textContent; // console.log(ans); return { name, location, industry }; }); console.log(ans); await browserOpen.close(); } catch (err) { console.log(err); } })();

Just out of curiosity - what do you use similarweb data for?只是出于好奇-您将相似的网络数据用于什么目的?

You can try using https://github.com/bda-research/node-crawler that has delays and max connections params您可以尝试使用具有延迟和最大连接参数的https://github.com/bda-research/node-crawler

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM