如何从 Apify Cheerio 爬虫获取整个 html？

Question

I want to get the whole html not just text.我想获得整个 html 而不仅仅是文本。

Apify.main(async () => {


const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({ 
    url: //adress,
    uniqueKey: makeid(100)

}); });

const handlePageFunction = async ({ request, $ }) => {
    var content_to = $('.class')

    
};

// Set up the crawler, passing a single options object as an argument.
const crawler = new Apify.CheerioCrawler({
    requestQueue,
    handlePageFunction,
});

await crawler.run();

}); });

When I try this the crawler returns complex object.当我尝试这个时，爬虫返回复杂的 object。 I know I can extract the text from the content_to variable using.text() but I need the whole html with tags like.我知道我可以使用.text() 从 content_to 变量中提取文本，但我需要带有类似标签的整个 html。 What should I do?我应该怎么办？

Answer 1

If I understand you correctly - you could just use .html() instead of .text() .如果我理解正确 - 你可以只使用.html()而不是.text() 。 This way you will get inner html instead of inner text of the element.这样，您将获得内部 html 而不是元素的内部文本。

Another thing to mention - you could also put body to handlePageFunction arg object: const handlePageFunction = async ({ request, body, $ }) => {另一件要提的事情 - 你也可以把body放到handlePageFunction arg object: const handlePageFunction = async ({ request, body, $ }) => {

body would have the whole raw html of the page. body将包含页面的整个原始 html。

如何从 Apify Cheerio 爬虫获取整个 html？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-25 14:40:47

如何从 Apify Cheerio 爬虫获取整个 html？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-25 14:40:47

解决方案1
1 已采纳 2020-12-25 14:40:47