Get Original URL instead of base64 url from src attribute

Question

I am trying to get data from this site: https://balkangreenenergynews.com/country/romania/

Problem is that when I try to extract image url(image_link) via "src" attribute it return URL in base64 format.

I have give Output below:

[{link:
 'https://balkangreenenergynews.com/...nsson/',
image_link:
 'data:image/svg+xml;base64,PHN2ZyB4...3ZnPg==',
lead_text:
 'Distribution ... farm.',
time: '29 July 2021',
author: '' }, ...]

Code:

const scraperObject = {
  url: 'https://balkangreenenergynews.com/country/romania/',
  async scraper(browser){
  let page = await browser.newPage();
  await page.goto(this.url)
  .catch(error => console.error(error));
  try {
    await page.waitForSelector("div.four-boxes.multi-boxes", { visible: true });
    //console.info("Country News Page loaded");
    
    page.on("console", msg =>
      msg.type() === "error"
        ? console.error(msg.text())
        : console.info(msg.text())
    );
    let data = await page.evaluate(() => {
      const articles = document.querySelectorAll("div.bn-box");
      const textContent = elem => (elem ? elem.textContent.trim() : ""); // helper function
      const articleArray = [];
      //let element = await page.$('your selector')
      //await element.evaluate(el => el.textContent)
      articles.forEach(article => {
        
        //console.log(article.querySelector("div.bn-box-img > a img").getAttribute("src"))
        articleArray.push({
          title:
            textContent(article.querySelector("div.bn-box > a > h3")) || "",
          link: article.querySelector("div.bn-box > a")
            ? article.querySelector("div.bn-box > a").getAttribute("href")
            : "",
          image_link: article.querySelector("div.bn-box-img > a > img")
            ? article.querySelector("div.bn-box-img > a > img").getAttribute("src")
            : "",
          lead_text:
            textContent(article.querySelector("div.bn-box > p")).split(' ').slice(4).join(' ') ||
            "",
          time: textContent(article.querySelector("p > strong")) ||
          "",
          author: ""
            //textContent(article.querySelector(".entry-author a")) || ""
        });
      });
      //console.log(articles);
      //return;
      return articleArray;
    });
    console.log(data)

  } catch (error) {
    console.log(":(");
    //console.error("No articles found for " + country.slug + error);
  }
}}

How do I get that particular URL as I am saving these URLs directly to databases?

Answer 1

Puppeter is bothering us

I was able to replicate your code:

https://github.com/jrichardsz/dokku-puppeteer-example/blob/rare-puppeter-behavior/app.js

According to the following research, it seams that puppeter is changing the real html.

I tried your code directly on browser console and I get this html of first article:

const articles = document.querySelectorAll("div.bn-box");
articles[0].innerHTML

But when I ran your puppeter code printing the first article ( console.log(articles[0].innerHTML); ), the html for the same article changes :

I don't find any on internet about this puppeter behavior

Just to check

If you click on some article, after its load, inspecting I see this:

I don't know but the origin page could be changing the response html according to the client:

real browser for humans
headless or in memory browser for automation (puppeter)

Try with selenium instead puppeter

You could use this starter to use selenium instead puppeter

Get Original URL instead of base64 url from src attribute

Question

1 answers

solution1
0 2021-11-21 08:38:46

Puppeter is bothering us

Just to check

Try with selenium instead puppeter

Get Original URL instead of base64 url from src attribute

Question

1 answers

solution1 0 2021-11-21 08:38:46

Puppeter is bothering us

Just to check

Try with selenium instead puppeter

solution1
0 2021-11-21 08:38:46