简体   繁体   中英

Instagram scraping

The following code is working on a computer to scrape data from Instagram account. When I try to use it on a VPS server I'm redirected to the Instagram Login page so the script doesn't work.

Why does Instagram doesn't react the same way when I'm on a computer or on a server?

It's the same with wget. On a computer I have the profile page, on a server I'm redirected to the login page.

import requests
import re


class InstagramScraper:
    """
    Scraper of Instagram profiles infos.
    """

    def __init__(self, session: requests.Session, instagram_account_name: str):
        self.session = session
        self._account_name = self.clean_account_name(instagram_account_name)
        self.load_data()

    def load_data(self):
        #print(self._account_name)
        response = self.session.get("https://www.instagram.com/{account_name}/".format(account_name=self._account_name))
        #print(response)
        #print(response.text)
        publications_regex = r'"edge_owner_to_timeline_media":{"count":(\d*),'
        self._publications = re.search(publications_regex, response.text).group(1)

        followers_regex = r'"edge_followed_by":{"count":(\d*)'
        self._followers = re.search(followers_regex, response.text).group(1)

        # title_regex = r'"@type":".*","name":"(.*)",'
        title_regex = r'"full_name":"(.*)",'
        self._title = re.search(title_regex, response.text).group(1)
        self._title = self._title.split('\"')[0]

        following_regex = r'"edge_follow":{"count":(\d*)}'
        self._following = re.search(following_regex, response.text).group(1)

    def clean_account_name(self, value) -> str:
        """
        Return the account name without the url address.
        """
        found: str = re.search("https://www.instagram.com/(.*)/", value)
        if found:
            return found.group(1)
        return value

    @property
    def publications(self) -> int:
        """
        Number of publications by this account.
        """
        return self._publications

    @property
    def followers(self) -> int:
        """
        Number of followers of this account.
        """
        return self._followers

    @property
    def title(self) -> str:
        """
        Name of the Instagram profile.
        """
        return self._title

    @property
    def account(self) -> str:
        """
        Account name used on Instagram.
        """
        return self._account_name

    @property
    def following(self) -> int:
        """
        Number of accounts this profile is following.
        """
        return self._following

    def __str__(self) -> str:
        return str({
            'Account': self.account,
            'Followers': self.followers,
            'Publications': self.publications,
            'Following': self.following,
            'Title': self.title,
        })


if __name__ == "__main__":
    with requests.session() as session:
        scraper = InstagramScraper(session, "https://www.instagram.com/ksc_lokeren/")
        print(scraper)

You see login prompt from Instagram because you are being blocked. Instagram detects that you are not manually browsing their website.

If you want to extract info for Instagram profile you have to rely on an API for scraping since Instagram will block you very quickly.

Here is a good tutorial on scraping user profile data and posts that handles pagination using an API for scraping: https://scrapingfish.com/blog/scraping-instagram

It might be because you are logged in with your own credentials on your computer? furas mentioned a blacklist, but if you've never ran it on this server before, I doubt it.

What i was able to do to avoid that is use a headless browser , which simulates a normal browser and lets you navigate on websites. You would simulate a login with your credentials, then retrieve the csrftoken and sessionid from the cookies and close the browser.

I did mine in javascript so I can't really show it to you, but the logic is this one:

  1. Create your headless browser

  2. Set the 'accept-language' header of your request to 'en-US'

  3. Navigate to https://www.instagram.com/accounts/login/ . Wait until idle

  4. Emulate the sign-in with your credentials. Look for:

    'input[name="password"]' //for the password.

    'input[name="username"]' //for username.

    'button[type="submit"]' //for the login button

  5. Wait until idle

  6. Get all cookies and retrieve the csrftoken and sessionid

  7. Close the headless browser

Then, when doing any request to https://www.instagram.com/{account_name}/ , don't forget to set the csrftoken and sessionid cookies in your request header. After a while it will expire, you'll need to restart

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM