简体   繁体   English

Scrapy Web Scraping 和 Facebook

[英]Scrapy Web Scraping and Facebook

Any thoughts on why i can't login?关于为什么我无法登录的任何想法? I've been trying to login via facebook and linkedin using the same method;我一直在尝试使用相同的方法通过 facebook 和linkedin 登录; no success.没有成功。 I'm using the most recent version of Scrapy.我正在使用最新版本的 Scrapy。 I am trying to get to 'Messages' to test, but I know it doesn't work because it redirects me back to the login page... same thing on LinkedIn.我正在尝试使用“消息”进行测试,但我知道它不起作用,因为它会将我重定向回登录页面......在 LinkedIn 上也是如此。

import scrapy
from scrapy.spiders import BaseSpider
from scrapy.http import FormRequest
from scrapy.contrib.spiders import CrawlSpider
from linkedIn.items import LinkedinItem
from scrapy.http import Request
#from spider.settings import JsonWriterPipeline

class MySpider (CrawlSpider):
    name = 'fb'
    allowed_domains = ['facebook.com']
    start_urls = ['https://login.facebook.com/login.php']

def parse(self, response):
    return [FormRequest.from_response(response,
                formname='login_form',
                formdata={'email':'my_email@example.com',
                          'pass':'test!'},
                callback=self.after_login)]
def after_login(self, response):
    # check login succeed before going on
    if "the password you entered is incorrect" in response.body:
        self.log("\n\n\n\nLogin failed\n\n\n\n", level=self.log())
        return
    else:
        self.log("\n\n\n Login was successful!!!\n\n\n")
        self.log(response.body)
        return Request(url="https://facebook.com/messages",
               callback=self.parse_items)

def parse_items(self,response):
    hxs = scrapy.Selector(response)
    titles =hxs.xpath("//title")
    items = []
    for title in titles:
        item = LinkedinItem()
        item['friendName']= titles.xpath("//title").extract()
        #item['numberOffriends']= titles.select("some path here").extract().pop()    
        items.append(item)
    return (items)

Both Facebook and Linkedin use CSRF tokens. Facebook 和 Linkedin 都使用 CSRF 令牌。 You have to first GET the page with the login form, then parse the HTML and get the CSRF token and then lastly make a POST request with username/password and CSRF token.您必须首先使用登录表单获取页面,然后解析 HTML 并获取 CSRF 令牌,然后最后使用用户名/密码和 CSRF 令牌发出 POST 请求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM