简体   繁体   中英

Using python to scrape a site (no form, session cookies)

I'm trying to scrape a website for which I have a legitimate login. When you try to visit this site, you get redirected to Verify.aspx until you enter a legitimate value for access-code . Using the requests library for Python, I have tried the following:

url1 = "<url>/Verify.aspx"
payload = {"access-code": "xxxxxxxx" }
ses = requests.Session()
r = requests.get(url1, data=payload)

When I then look at the value or r.cookies , I see that I've grabbed a bunch of cookies, stored in a cookie jar:

<RequestsCookieJar[Cookie(version=0, name='ASP.NET_SessionId', value='...)]>

At this point, I'd like to retain the session information, and include it in future requests. For example, below, I'd like to browse a normal page (ie home ), so I try to visit the page, and I try to send the cookies along.

test = ses.post('<url>/home', cookies= r.cookies)

However, at this stage, when I look at the test.text in python, I can see from the code that I've just been redirected back to the original Verify.aspx page.

I've done a fair amount of googling with no success. I have some familiarity with Python but none with scraping. I'd actually prefer and R solution since I'm better with it, but it seems to me that the Python scraping libraries are better than the R packages. I don't want to use something like Selenium, unless it's via Python or R, since I want to pull and process data without any user interaction.

I feel stuck - I know that I'm passing a legitimate code, and am getting valid session cookies back since I can log in just fine via the normal webpage. I just don't know how to capture, save and pass the session cookie information back to the page during the ensuing URL calls.

Everything I've read indicates to me that the requests library should handle everything about the cookies, but I think I'm just passing it incorrectly.

Can someone suggest what I might try next?

#

EDIT Thank you for looking, @Faboor. I think I'm on a better track, since I now have a different error message. Now when I look at the content of print(test.text) , it says "Your browser sent a request that this server could not understand.". Is it OK that the second URL I try to access is not a discrete page (ie it seems like a folder, rather than something like index.html)?

In your example, you create a session, but not use it to login. Assuming that this is just a cookie manipulation issue, using ses.get and ses.post instead of requests.get should solve your problem.

url1 = "<url>/Verify.aspx"
payload = {"access-code": "xxxxxxxx" }
ses = requests.Session()
r = ses.get(url1, data=payload)
test = ses.post('<url>/home')

You can check out what cookies are stored with the session using ses.cookies . Or for better readability (although losing some information about cookie origins) you can use dict(ses.cookies)

For more information about requests sessions, check out the advanced usage docs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM