简体   繁体   English

从 src 属性获取原始 URL 而不是 base64 url

[英]Get Original URL instead of base64 url from src attribute

I am trying to get data from this site: https://balkangreenenergynews.com/country/romania/我正在尝试从该站点获取数据: https://balkangreenenergynews.com/country/romania/

Problem is that when I try to extract image url(image_link) via "src" attribute it return URL in base64 format.问题是当我尝试通过“src”属性提取图像 url(image_link) 时,它返回 base64 格式的 URL。

I have give Output below:我在下面给出了 Output:

[{link:
 'https://balkangreenenergynews.com/...nsson/',
image_link:
 '...3ZnPg==',
lead_text:
 'Distribution ... farm.',
time: '29 July 2021',
author: '' }, ...]

Code:代码:

const scraperObject = {
  url: 'https://balkangreenenergynews.com/country/romania/',
  async scraper(browser){
  let page = await browser.newPage();
  await page.goto(this.url)
  .catch(error => console.error(error));
  try {
    await page.waitForSelector("div.four-boxes.multi-boxes", { visible: true });
    //console.info("Country News Page loaded");
    
    page.on("console", msg =>
      msg.type() === "error"
        ? console.error(msg.text())
        : console.info(msg.text())
    );
    let data = await page.evaluate(() => {
      const articles = document.querySelectorAll("div.bn-box");
      const textContent = elem => (elem ? elem.textContent.trim() : ""); // helper function
      const articleArray = [];
      //let element = await page.$('your selector')
      //await element.evaluate(el => el.textContent)
      articles.forEach(article => {
        
        //console.log(article.querySelector("div.bn-box-img > a img").getAttribute("src"))
        articleArray.push({
          title:
            textContent(article.querySelector("div.bn-box > a > h3")) || "",
          link: article.querySelector("div.bn-box > a")
            ? article.querySelector("div.bn-box > a").getAttribute("href")
            : "",
          image_link: article.querySelector("div.bn-box-img > a > img")
            ? article.querySelector("div.bn-box-img > a > img").getAttribute("src")
            : "",
          lead_text:
            textContent(article.querySelector("div.bn-box > p")).split(' ').slice(4).join(' ') ||
            "",
          time: textContent(article.querySelector("p > strong")) ||
          "",
          author: ""
            //textContent(article.querySelector(".entry-author a")) || ""
        });
      });
      //console.log(articles);
      //return;
      return articleArray;
    });
    console.log(data)

  } catch (error) {
    console.log(":(");
    //console.error("No articles found for " + country.slug + error);
  }
}}

How do I get that particular URL as I am saving these URLs directly to databases?当我将这些 URL 直接保存到数据库时,如何获得特定的 URL?

Puppeter is bothering us木偶戏打扰我们了

I was able to replicate your code:我能够复制您的代码:

According to the following research, it seams that puppeter is changing the real html.根据以下研究,木偶正在改变真正的html

I tried your code directly on browser console and I get this html of first article:我直接在浏览器控制台上尝试了你的代码,我得到了第一篇文章的 html:

const articles = document.querySelectorAll("div.bn-box");
articles[0].innerHTML

在此处输入图像描述

But when I ran your puppeter code printing the first article ( console.log(articles[0].innerHTML); ), the html for the same article changes :但是当我运行你的 puppeter 代码打印第一篇文章( console.log(articles[0].innerHTML); )时,同一篇文章的 html 发生了变化

在此处输入图像描述

I don't find any on internet about this puppeter behavior我在互联网上没有找到任何关于这种木偶行为的信息

Just to check只是为了检查

If you click on some article, after its load, inspecting I see this:如果您单击某篇文章,在加载后检查我会看到:

在此处输入图像描述

I don't know but the origin page could be changing the response html according to the client:我不知道,但原始页面可能会根据客户端更改响应 html:

  • real browser for humans真正的人类浏览器
  • headless or in memory browser for automation (puppeter)无头或在 memory 浏览器中进行自动化(木偶)

Try with selenium instead puppeter尝试使用 selenium 代替木偶

You could use this starter to use selenium instead puppeter您可以使用启动器来使用 selenium 代替 puppeter

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM