[英]How to make puppeteer run headless with the apify sdk?
I am trying to scrape the content of a page using the apify sdk.我正在尝试使用 apify sdk 抓取页面的内容。 This also works nicely with the following code.
这也适用于以下代码。 But how can I force using the Apify SDK the headless mode as with puppeteer.launch({headless: true})?
但是,如何像 puppeteer.launch({headless: true}) 一样强制使用 Apify SDK 的无头模式?
Code for your reference:供您参考的代码:
async function scrape(number) {
let output = { links: [], title: [], content: [] };
const URL = "https://somepage/";
process.env.APIFY_LOCAL_STORAGE_DIR = '/someappfolder/apify_storage/run_' + number;
const requestQueue = await Apify.openRequestQueue(number);
await requestQueue.addRequest({ url: URL });
const pseudoUrls = [new Apify.PseudoUrl(URL + "[.*]")];
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
handlePageFunction: async ({ request, page }) => {
output.links.push(request.url);
output.title.push(await page.title());
output.content.push((await page.content()).length);
var save = { url: request.url, title: await page.title(), content: (await page.content()).length };
//sendToAirtable(save);
console.log(`URL: ${request.url}`);
await Apify.utils.enqueueLinks({
page,
selector: 'a',
pseudoUrls,
requestQueue,
});
},
maxRequestsPerCrawl: 10,
maxConcurrency: 10,
minConcurrency: 2,
});
await crawler.run();
return output;
};
添加launchPuppeteerOptions: { headless: true }
与requestQueue
https://sdk.apify.com/docs/typedefs/launch-puppeteer-options#docsNav在同一级别
process.env.APIFY_HEADLESS = 1;
经过数小时的搜索,我偶然发现了答案... https://sdk.apify.com/docs/guides/environment-variables#apify_headless
Neither of the in-code answers here would work.这里的代码内答案都不起作用。 I had to google this, and this seems to work.
我不得不谷歌这个,这似乎工作。
const Apify = require('apify');
Apify.main(async () => {
const baseurl = 'https://thedomain.youwanna.check.com/somepage';
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({ url: baseurl });
const options = {
requestQueue,
launchContext: {
launchOptions: {
headless: true,
slowMos: 1000,
}
},
handlePageFunction: async ({ request, page }) => {
const title = await page.title();
console.log(`Title of ${request.url}: ${title}`);
request
await Apify.utils.enqueueLinks({
requestQueue,
page,
pseudoUrls: [baseurl + '[.*]'],
});
},
};
const crawler = new Apify.PuppeteerCrawler(options);
await crawler.run();
});
You can add the headless option to the launchPuppeteerOptions
like this:您可以像这样将 headless 选项添加到
launchPuppeteerOptions
:
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
launchPuppeteerOptions: {
headless: true,
ignoreHTTPSErrors: true,
// slowMo: 500,
},
maxRequestsPerCrawl: settings.maxurls,
maxConcurrency: settings.maxcrawlers,
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.