简体   繁体   English

我无法通过请求登录 Instagram

[英]I can't login to Instagram with Requests

I've been trying to login to Instagram using the Requests library but I can't get it to work.我一直在尝试使用Requests库登录 Instagram,但无法正常工作。 The connection always get refused.连接总是被拒绝。

import requests

#Creating URL, usr/pass and user agent variables

BASE_URL = 'https://www.instagram.com/'
LOGIN_URL = BASE_URL + 'accounts/login/ajax/'
USERNAME = '******'
PASSWD = '******'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)\
 Chrome/59.0.3071.115 Safari/537.36'

#Setting some headers and refers
session = requests.Session()
session.headers = {'user-agent': USER_AGENT}
session.headers.update({'Referer': BASE_URL})


try:
    #Requesting the base url. Grabbing and inserting the csrftoken

    req = session.get(BASE_URL)
    session.headers.update({'X-CSRFToken': req.cookies['csrftoken']})
    login_data = {'username': USERNAME, 'password': PASSWD}

    #Finally login in
    login = session.post(LOGIN_URL, data=login_data, allow_redirects=True)
    session.headers.update({'X-CSRFToken': login.cookies['csrftoken']})

    cookies = login.cookies

    #Print the html results after I've logged in
    print(login.text)

#In case of refused connection
except requests.exceptions.ConnectionError:
    print("Connection refused")

I don't know what I'm doing wrong.我不知道我做错了什么。 I would really appreciate if anyone posted any solutions.如果有人发布任何解决方案,我将不胜感激。 Please do not suggest API or Selenium (They're not an option for me at the moment)请不要推荐API 或 Selenium (目前它们不是我的选择)

Since requests doesn't execute JavaScript's you don't have the CSRFToken in your cookies.由于请求不执行 JavaScript,因此您的 cookie 中没有 CSRFToken。

If you have a look at the content you can find the csrf_token inside the html.如果您查看内容,您可以在 html 中找到 csrf_token。

Using bs4 and json you can extract it and use it in your post.使用 bs4 和 json 您可以提取它并在您的帖子中使用它。

from bs4 import BeautifulSoup
import json, random, re, requests

BASE_URL = 'https://www.instagram.com/accounts/login/'
LOGIN_URL = BASE_URL + 'ajax/'

headers_list = [
        "Mozilla/5.0 (Windows NT 5.1; rv:41.0) Gecko/20100101"\
        " Firefox/41.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2)"\
        " AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2"\
        " Safari/601.3.9",
        "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0)"\
        " Gecko/20100101 Firefox/15.0.1",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"\
        " (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36"\
        " Edge/12.246"
        ]


USERNAME = '****'
PASSWD = '*****'
USER_AGENT = headers_list[random.randrange(0,4)]

session = requests.Session()
session.headers = {'user-agent': USER_AGENT}
session.headers.update({'Referer': BASE_URL})    
req = session.get(BASE_URL)    
soup = BeautifulSoup(req.content, 'html.parser')    
body = soup.find('body')

pattern = re.compile('window._sharedData')
script = body.find("script", text=pattern)

script = script.get_text().replace('window._sharedData = ', '')[:-1]
data = json.loads(script)

csrf = data['config'].get('csrf_token')
login_data = {'username': USERNAME, 'password': PASSWD}
session.headers.update({'X-CSRFToken': csrf})
login = session.post(LOGIN_URL, data=login_data, allow_redirects=True)
login.content
# b'{"authenticated": true, "user": true, "userId": "*******", "oneTapPrompt": false, "status": "ok"}'

Have in mind that most of the data in instagram it's loaded with javascript, so you may have more troubles in future.请记住,instagram 中的大部分数据都是用 javascript 加载的,因此您将来可能会遇到更多麻烦。

You can refer to this post on how to recover data : https://stackoverflow.com/a/49831347您可以参考这篇关于如何恢复数据的帖子: https : //stackoverflow.com/a/49831347

Or you can use different library like dryscrape or spynner或者您可以使用不同的库,如 dryscrape 或 spynner

它不是返回空的脚本,而是基于我的研究的.get_text()

About script returning empty on 2Pacho's answer, it's not Instagram that changed since his post, but rather the behavior of the get_text() method, altered in April 2020. From thebs4 Documentation :关于在 2Pacho 的回答中返回空的script ,自他的帖子以来发生变化的不是 Instagram,而是get_text()方法的行为,在 2020 年 4 月发生了变化。来自bs4 文档

As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are not considered to be 'text', since those tags are not part of the human-visible content of the page.从 Beautiful Soup 4.9.0 版本开始,当使用 lxml 或 html.parser 时,<script>、<style> 和 <template> 标签的内容不被视为“文本”,因为这些标签不是一部分页面的人类可见内容。

Using .contents[0] instead will do the job as intended:使用.contents[0]将按预期完成工作:

script = script.contents[0].replace('window._sharedData = ', '')[:-1]
data = json.loads(script)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM