简体   繁体   English

Node Js & Puppeteer - 如何将 select 文本包裹在 Anchor 标签内

[英]Node Js & Puppeteer - How to select text wrapped inside an Anchor tag

I'm working on a project at the moment, have run into an error and need your help!我现在正在做一个项目,遇到了一个错误,需要你的帮助!

Basically, I am trying to select the wrapped text inside the following anchor tag基本上,我正在尝试 select 以下锚标签内的包装文本

<a href="..." class="productDetailsLink js-productName">Product Name</a>

This is my current code:这是我当前的代码:

 await page.waitForSelector('div > div > div > div > div > a[class = "productDetailsLink js-productName"')
        .then(() => page.evaluate(() => {
            const itemArray = [];
            const itemNodeList = document.querySelectorAll('div > div > div > div > div > a[class = "productDetailsLink js-productName"');
            

            itemNodeList.forEach(item => {
                const itemTitle = item.querySelectorAll('div > div > div > div > div > a[class = "productDetailsLink js-productName"').innerText;
                console.log(itemTitle);
            })
        } ))

However, I'm not getting any luck.但是,我没有运气。 I've run out of ideas on how to scrape such text.我已经没有关于如何抓取此类文本的想法了。

Not sure how Puppeteer works but I've had great success using cheerio ( https://www.npmjs.com/package/cheerio ) for parsing scraped html with phantom .不确定 Puppeteer 是如何工作的,但我在使用cheeriohttps://www.npmjs.com/package/cheerio )解析刮擦的 html 和phantom方面取得了巨大成功。

I think you can use puppeteer like phatom for scraping and use cheerio on the scraped HTML content like this below:我认为您可以使用像 phatom 这样的 puppeteer 进行刮擦,并在刮擦的 HTML 内容上使用cheerio,如下所示:

const cheerio = require('cherio');
const $ = cheerio.load(content); // content is your HTML scraped
result = $('. productDetailsLink').text();

If those class attributes are unique to that particular anchor <a href="..." class="productDetailsLink js-productName">Product Name</a> , Following method could be used:如果这些 class 属性对于该特定锚<a href="..." class="productDetailsLink js-productName">Product Name</a>是唯一的,则可以使用以下方法:

await page.evaluate(() => {
 let anchorText = document.querySelector('a.productDetailsLink.js-productName').innerHTML;
 console.info("anchorText::", anchorText);
});

/*OR another way*/
await page.$eval('a.productDetailsLink.js-productName', e => e.innerHTML);

If there are a list of anchors:如果有锚列表:

await page.evaluate(() => {
 let anchorList = document.querySelectorAll('a.productDetailsLink.js-productName');
 anchorList.forEach(e => {
  let anchorText = e.innerHTML;
  console.info("anchorText::", anchorText);
 });
});

.innerText worked for me (not.text or.innerHTML) .innerText 为我工作(不是 .text 或 .innerHTML)

Credit: saw it here: https://learnscraping.com/nodejs-web-scraping-with-puppeteer/信用:在这里看到它: https://learnscraping.com/nodejs-web-scraping-with-puppeteer/

for the selector: choose to Inspect and Copy -> JS path.对于选择器:选择 Inspect and Copy -> JS path。

below I copied the JS Path of the "Advanced help" link here:下面我在这里复制了“高级帮助”链接的 JS 路径:

document.querySelector("#mdhelp-tabs > li.float-right > a") document.querySelector("#mdhelp-tabs > li.float-right > a")

Yes, it comes with "document.querySelector" and all ready to paste in the puppeteer Node.js code是的,它带有“document.querySelector”并准备好粘贴到 puppeteer Node.js 代码中

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM