简体   繁体   English

如何使用 Puppeteer Select RSS Feed 项目元素?

[英]How Do I Select RSS Feed Item Element Using Puppeteer?

I am trying to consume an RSS feed using Puppeteer and output a new feed, but so far every the value of every element in the new feed says "undefined."我正在尝试使用 Puppeteer 和 output 使用新提要来使用 RSS 提要,但到目前为止,新提要中每个元素的值都显示为“未定义”。 At first I thought it was due to the elements not having any attributes, but it seems that using querySelectorAll is also undefined when it should be grabbing the content of every item element in the feed.起初我认为这是由于元素没有任何属性,但似乎使用 querySelectorAll 也是未定义的,当它应该抓取提要中每个项目元素的内容时。 This is how I am querying the original feed这就是我查询原始提要的方式

await page.goto(url, {waitUntil: 'networkidle2'});
let rssitems = await page.evaluate(() => {
    let results;
    let items = document.querySelectorAll('item');
    items.forEach((item) => {

        results += '<title>' + item.querySelector('title').innerText + '</title>';
        results += '<description>' + item.querySelector('description').innerText + '</description>';
        results += '<link>' + item.querySelector('link').innerText + '</link>';
        results += '<guid>' + item.querySelector('guid').innerText + '</guid>';
        results += '<pubDate>' + item.querySelector('pubDate').innerText + '</pubDate>';
    });
    return results;
});

If I understand correctly, you are trying to interact with RSS document.如果我理解正确,您正在尝试与 RSS 文档进行交互。 RSS is XML, not HTML, so you need API for Node classes, not HTMLElement classes. RSS 是 XML,而不是 HTML,因此 Node 类而不是 HTMLElement 类需要 API。 So instead of HTMLElement.innerText , you can try Node.textContent .因此,您可以尝试Node.textContent而不是HTMLElement.innerText

I ended up going with an RSS parser.我最终选择了一个 RSS 解析器。 This code uses Puppeteer to take a screenshot of images specified by the url attribute of an enclosure element.此代码使用 Puppeteer 截取由附件元素的 url 属性指定的图像的屏幕截图。 This is good for scraping sites that do not allow images to be downloaded using curl requests.这对于抓取不允许使用 curl 请求下载图像的站点很有用。

let Parser = require('rss-parser');
let parser = new Parser();
let rssitems = await (async () => {
        let results;
        let feed = await parser.parseURL(url);
        feed.items.forEach(item => {
            let enclurl = item.enclosure.url;
            const filepath = './images/' + unique() + '.jpg';
            takeScreenshot(enclurl, filepath)
                .then(() => {
                    console.log("Screenshot taken");
                })
                .catch((err) => {
                    console.log("Error occured!");
                    console.dir(err);
                });
            results += '<title><![CDATA[' + item.title.trim()  + ']]></title>';
            results += '<description><![CDATA[<img src="' + host + filepath.slice(1) + '">' + item.content.trim() + ']]></description>';
            results += '<link>' + item.link.trim() + '</link>';
            results += '<guid>' + item.guid.trim() + '</guid>';
            results += '<pubDate>' + item.pubDate.trim() + '</pubDate>';
        });
        return results;
    })();
async function takeScreenshot(enclurl, filepath) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto( enclurl,{waitUntil: 'networkidle2'});
    const buffer = await page.screenshot({
        path: filepath
    });

    await page.close();
    await browser.close();
    }function unique() {
    return 'xxxxxxxx-xxxx-xxxx-xxxxxxxxxxxx'.replace(/[x]/g, function(c) {
        var r = Math.random() * 16 | 0, v = c == 'x' ? r : (r & 0x3 | 0x8);
        return v.toString(16);
    });
}```

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM