简体   繁体   中英

How to get text inside <a href> tag without the link in href, with Puppeteer

I am trying to scrape some date inside an tag, but I do not want to get the link that is inside it.

Not really sure how to approach the problem since the tags do not have ID's or classes

<div id="list-section">
    <ul>
        <li data-store-id="1234">
            <div class="item">
                <p>
                    <strong>
                    <a target="_blank" href="www.somelink.com"> NAME ONE </a>
                    </strong>
                </p>
            </div>
        </li>
        <li data-store-id="1234">
            <div class="item">
                <p>
                    <strong>
                    <a target="_blank" href="www.somelink.com"> NAME TWO </a>
                    </strong>
                </p>
            </div>
        </li>
    </ul>
</div>

I am trying to have every name in an array at the end [NAME ONE, NAME TWO] etc.

Edit: using node with puppeteer

There is a way to find elements that is very useful when web scraping named xpath . Never worked with puppeteer, but I've been worked a lot with selenium recently and I used xpath a lot.

Just a quick view in the docs of puppeteer and I found something that could be useful for you.

https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagexexpression

Due to I don't have the full html page, I was able to make a simple xPath to demonstrate its power.

//div[@class='item']//a

You can also test xpath opening Google Chrome DevTools in " Elements " tab and pressing CTRL+F

It's a nice tool for having when web scraping.

You can have the names in an array in two steps:

  • Select the anchor tags <a>...</a>
  • Get their inner HTMLs

As Douglas mentioned before, you can use XPath, but in this case simple CSS selectors will do the job just fine. As a CSS selector, many combination can get you the anchor tags: #list-section a , ul a ...

Choose the one that fits you most and is least likely to brake later. I recommend using the first one:

const anchorTags = await page.$$("#list-section a")

As to getting the inner HTML of an element, this SO question will definitely help you. My preferred approach is to have a separate asynchronous function defined as follows:

async function getInnerHtml(page, target){
  const innerHTML = await page.evaluate(el => el.innerHTML, target)
  return innerHTML
}

This way you would loop on your array and call it on your anchor tags.

Don't forget that there is always many ways to build a scraper. Seems to me like you focused too much on the element, and wanted to select it precisely . Also, it is necessary to get a good grasp of CSS selectors, especially CSS combbinators .

Cheers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM