简体   繁体   English

如何使用 puppeteer(Node.js 应用程序)抓取 instagram 帖子 URL

[英]How to scrape instagram post URL's using puppeteer (Node.js applicatie)

With all the changes to the current Instagram API I was trying to build a scraper.随着对当前 Instagram API 的所有更改,我试图构建一个刮板。 After some looking around I found puppeteer.环顾四周后,我找到了傀儡师。 Although it seems really straightforward I am running into a problem I can't seem to wrap my head around.虽然看起来很简单,但我遇到了一个问题,我似乎无法解决问题。

The problem is the following: I know what the div tag of a post is (.v1Nh3.kIKUG._bz0w) and how to call for it (elements = await page.$$('.v1Nh3.kIKUG._bz0w');)问题如下:我知道帖子的 div 标签是什么(.v1Nh3.kIKUG._bz0w)以及如何调用它(elements = await page.$$('.v1Nh3.kIKUG._bz0w');)

If I understand the $ function correctly this should return me a promise containing an array of all the posts on 'page'.如果我正确理解 $ function 这应该返回给我一个 promise 包含“页面”上所有帖子的数组。

My first question would be if this assumption is correct, and my second is how I can get the array out of.我的第一个问题是这个假设是否正确,我的第二个问题是如何将数组取出。 (And if that all works how to get the redirect URL contained in the child href) (如果一切正常,如何获得子 href 中包含的重定向 URL )

In order to get elements with a certain class and return them you must use the page.evaluate method.为了获取具有特定 class 的元素并返回它们,您必须使用page.evaluate方法。 This is an asynchronous call which returns a promise.这是一个异步调用,它返回 promise。

So, in your use case, it should look like this:因此,在您的用例中,它应该如下所示:

const result = await page.evaluate(() => {
    let elements = document.querySelectorAll('.v1Nh3.kIKUG._bz0w');

    let elementsArr = [];
    //Loop over elements in the array and create objects from each element 
    //with the data relevant to your logic
    for (let element of elements) {
        resultArr.push({
           //your logic
        });
    }
    return elementsArr;
});

First things first: since Instagram is a heavy javascript-powered React application, the selectors you are after may not be available right after the page is loaded.首先要做的事情是:由于 Instagram 是一个由 javascript 驱动的重型 React 应用程序,因此您所使用的选择器在页面加载后可能无法立即使用。 So we should wait for them to appear in the DOM:所以我们应该等待它们出现在 DOM 中:

await page.waitForSelector('.v1Nh3.kIKUG._bz0w');

Now with page.evaluate we get the posts, but since you only want the links inside of those posts, let's grab them right away in the query:现在使用 page.evaluate 我们得到帖子,但由于您只想要这些帖子中的链接,让我们立即在查询中获取它们:

const result = await page.evaluate(() => {
    // Get elements into a NodeList
    const elements = document.querySelectorAll('.v1Nh3.kIKUG._bz0w a');
    ...
}

But we cant convert the elements from Nodelist to an Array and just return them, because they're still DOM nodes, complex unserializable objects, and they need to be serializable to be able to return from page.evaluate .但是我们不能将元素从 Nodelist 转换为 Array 并返回它们,因为它们仍然是 DOM 节点,复杂的不可序列化对象,它们需要可序列化才能从page.evaluate返回。 So instead of returning the complete nodes we'll just get what we need: urls from href attribute:因此,我们不会返回完整的节点,而是得到我们需要的:来自 href 属性的 url:

const result = await page.evaluate(() => {
    // Get elements into a NodeList
    const elements = document.querySelectorAll('.v1Nh3.kIKUG._bz0w a');

    // Convert elements to an array, 
    // then for each item of that array only return the href attribute
    const linksArr = Array.from(elements).map(link => link.href);

    return linksArr;
});

Other ways to do it其他方法

In your question you mentioned page.$$ method.在您的问题中,您提到了page.$$方法。 It is indeed applicable here to get handles of the objects we seek.这里确实适用于获取我们寻找的对象的句柄 But the code to iterate over them is not pretty:但是迭代它们的代码并不漂亮:

const results = await page.$$('.v1Nh3.kIKUG._bz0w a')
for (const i in results)
{
   console.log(await(await(await results[i]).getProperty("href")).jsonValue());
}

My favourite way to get those links though would be to use page.$$eval method:我最喜欢的获取这些链接的方法是使用page.$$eval方法:

const results = await page.$$eval('.v1Nh3.kIKUG._bz0w a', links => links.map(link => link.href))

It does exactly the same what we did in page.evaluate solution but in a much more concise way.它与我们在page.evaluate解决方案中所做的完全相同,但方式更加简洁。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM