简体   繁体   中英

Web Scraping with python requests

I want to scrape https://sparrow.eoffice.gov.in/IPRSTATUS/IPRFiledSearch and download the entire set of PDF files that show up in the search results as on date (say 01-01-2016). The employee fields are optional. On clicking search, the site throws up a list of all the employees. I am unable to get the post method to work using python requests. Keep getting a 405 error. My code is below

from bs4 import BeautifulSoup
import requests

url = "https://sparrow.eoffice.gov.in/IPRSTATUS/IPRFiledSearch"

data = {
    'assessmentYearId':'vH4pgBbZ8y8rhOFBoM0g7w',
    'empName':'',
    'allotmentYear':'',
    'cadreId':'',
    'iprReportType':'cqZvyXc--mpmnRNfPp2k7w',
    'userType':'JgPOADxEXU1jGi53Xa2vGQ',
    '_csrf':'7819ec72-eedf-4290-ba70-6f2b14cc4b79'
}

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate',
    'Accept-Language':'en-US,en;q=0.8',
    'Cache-Control':'max-age=0',
    'Connection':'keep-alive',
    'Content-Length':'184',
    'Content-Type':'application/x-www-form-urlencoded',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.post(url,data=data,headers=headers)

I'm not familiar with the website but I strongly suggest reading their policy before trying to scrape the content.

In similar scenarios when you don't get the expected results by a simple post, using requests.Session usually helps.

The problem lay in my using the same csrf code. Needs to be changed with every request.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM