简体   繁体   English

网络抓取返回 URI 而不是图像的 URL。 (Javascript Cheerio)

[英]Web-scraping returns URI not URL of image. (Javascript Cheerio)

I'm using Cheerio and request to web scrape image url's.我正在使用 Cheerio 并请求 web 抓取图像网址。 I keep getting the URI when i want to get the URL.当我想获取 URL 时,我不断获取 URI。 What can i change to fix this?我可以改变什么来解决这个问题?

const request = require('request-promise');
const cheerio = require ('cheerio');

(async () => {

    const webUrl = 'https://www.redbubble.com/lists/9747201/favorites';

    const response = await request(webUrl)

    const $ = cheerio.load(response);

    let sticker = $('img[class = "styles__image--2CwxX styles__rounded--1lyoH styles__fluid--3dxe-" ]').attr('src');

    console.log(sticker);
})();

It keeps returning它不断返回

""

When it should return什么时候应该返回

https://ih1.redbubble.net/image.479946364.2928/st,medium,507x507-pad,600x600,f8f8f8.u7.jpg

This is because the page source contains images that returns that value.这是因为页面源包含返回该值的图像。 It seems that they've encrypted the value with an output of base64 and then decrypted the src once the page has been loaded.似乎他们已经使用 base64 的 output 加密了该值,然后在加载页面后解密了src

You have a better chance of scraping the contents using puppeteer which provides a high level API to control the browser (headless or not).您有更好的机会使用puppeteer抓取内容,它提供了高级别的 API 来控制浏览器(无头或无头)。 You can simply wait for the browser to finish loading the page, and then scrape the contents you wish to obtain.您可以简单地等待浏览器完成页面加载,然后抓取您希望获取的内容。

Another alternative is to read through the uglified JS source code that page you're trying to scrape, and look for the portion where it performs the decryption.另一种选择是通读你试图抓取的页面的丑陋的 JS 源代码,并寻找它执行解密的部分。

UPDATE:更新:

You may not need cheerio or puppeteer at all.你可能根本不需要cheerio 或puppeteer。 Upon checking the XHR requests in the page itself, I found that it uses a graphql API to get all those images and contents.在检查页面本身的 XHR 请求后,我发现它使用 graphql API 来获取所有这些图像和内容。 Kindly investigate the request to get the proper results that you require.请调查请求以获得您需要的正确结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM