简体   繁体   中英

Get Original URL instead of base64 url from src attribute

I am trying to get data from this site: https://balkangreenenergynews.com/country/romania/

Problem is that when I try to extract image url(image_link) via "src" attribute it return URL in base64 format.

I have give Output below:

[{link:
 'https://balkangreenenergynews.com/...nsson/',
image_link:
 'data:image/svg+xml;base64,PHN2ZyB4...3ZnPg==',
lead_text:
 'Distribution ... farm.',
time: '29 July 2021',
author: '' }, ...]

Code:

const scraperObject = {
  url: 'https://balkangreenenergynews.com/country/romania/',
  async scraper(browser){
  let page = await browser.newPage();
  await page.goto(this.url)
  .catch(error => console.error(error));
  try {
    await page.waitForSelector("div.four-boxes.multi-boxes", { visible: true });
    //console.info("Country News Page loaded");
    
    page.on("console", msg =>
      msg.type() === "error"
        ? console.error(msg.text())
        : console.info(msg.text())
    );
    let data = await page.evaluate(() => {
      const articles = document.querySelectorAll("div.bn-box");
      const textContent = elem => (elem ? elem.textContent.trim() : ""); // helper function
      const articleArray = [];
      //let element = await page.$('your selector')
      //await element.evaluate(el => el.textContent)
      articles.forEach(article => {
        
        //console.log(article.querySelector("div.bn-box-img > a img").getAttribute("src"))
        articleArray.push({
          title:
            textContent(article.querySelector("div.bn-box > a > h3")) || "",
          link: article.querySelector("div.bn-box > a")
            ? article.querySelector("div.bn-box > a").getAttribute("href")
            : "",
          image_link: article.querySelector("div.bn-box-img > a > img")
            ? article.querySelector("div.bn-box-img > a > img").getAttribute("src")
            : "",
          lead_text:
            textContent(article.querySelector("div.bn-box > p")).split(' ').slice(4).join(' ') ||
            "",
          time: textContent(article.querySelector("p > strong")) ||
          "",
          author: ""
            //textContent(article.querySelector(".entry-author a")) || ""
        });
      });
      //console.log(articles);
      //return;
      return articleArray;
    });
    console.log(data)

  } catch (error) {
    console.log(":(");
    //console.error("No articles found for " + country.slug + error);
  }
}}

How do I get that particular URL as I am saving these URLs directly to databases?

Puppeter is bothering us

I was able to replicate your code:

According to the following research, it seams that puppeter is changing the real html.

I tried your code directly on browser console and I get this html of first article:

const articles = document.querySelectorAll("div.bn-box");
articles[0].innerHTML

在此处输入图像描述

But when I ran your puppeter code printing the first article ( console.log(articles[0].innerHTML); ), the html for the same article changes :

在此处输入图像描述

I don't find any on internet about this puppeter behavior

Just to check

If you click on some article, after its load, inspecting I see this:

在此处输入图像描述

I don't know but the origin page could be changing the response html according to the client:

  • real browser for humans
  • headless or in memory browser for automation (puppeter)

Try with selenium instead puppeter

You could use this starter to use selenium instead puppeter

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM