使用python从网站抓取数据

Question

I would like to crawl some data from a website. 我想从网站上抓取一些数据。 To manually access the target data, I need to log in and then click on some buttons on to finally get the target html page. 要手动访问目标数据，我需要登录，然后单击一些按钮以最终获得目标html页面。 Currently, I am using the Python request library to simulate this process. 目前，我正在使用Python request库来模拟此过程。 I am doing like this: 我这样做是这样的：

ss = requests.session()
#log in
resp = ss.post(url, data = (('username', 'xxx'), ('password', 'xxx')))
#then send requests to the target url
result = ss.get(taraget_url)

However, I found that the final request did not return me what I want. 但是，我发现最终请求没有返回我想要的东西。

So I changed the method. 所以我改变了方法。 I download all the network traffic and look into the headers and cookies of the last request. 我下载了所有网络流量，并查看了最后一个请求的标头和cookie。 I found that here are some contents that are different in each log in session like the sessionid and some other variables. 我发现这里有一些内容在每次登录会话中都是不同的，例如sessionid和其他一些变量。 So I traces back when these varibales are returned in the response and then get the values again by sending the corresponding requests. 因此，我回溯了响应中何时返回这些变量，然后通过发送相应的请求再次获取值。 After this, I construct the correct headers and cookies and then send request like this: 之后，我构造了正确的标头和cookie，然后发送如下请求：

resp = ss.get(target_url, headers = myheader, cookies = mycookie)

But still, it does not return me anything. 但是，它并没有给我任何回报。 Anyone can help? 有人可以帮忙吗？

Answer 1

I was in the same boat some time ago, and I eventually switched from trying to get requests to work to using Selenium instead, which made life much easier. 一段时间前，我在同一条船上，最终我从尝试获取工作请求转为使用Selenium，这使生活变得更加轻松。 ( pip install selenium ). （ pip install selenium ）。 Then you can log into a website and then navigate to a desired website like this: 然后，您可以登录网站，然后导航到所需的网站，如下所示：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

website_with_logins = "https://website.com"
website_to_access_after_login = "https://website.com/page"

driver.get( str(website_with_logins) )
username = driver.find_element_by_name("username")
username.send_keys("your_username")
password = driver.find_element_by_name("password")
password.send_keys("your_password")
password.send_keys(Keys.RETURN)
driver.get( str(website_to_access_after_login) )

Once you have the website_to_access_after_login loaded (you'll see it appear), you can get the html and have a field day using just 加载website_to_access_after_login （您会看到它出现），您可以获取HTML并使用

html = driver.page_source

Hope this helps. 希望这可以帮助。

使用python从网站抓取数据

问题描述

1 个解决方案

解决方案1
1 2014-08-23 14:47:46

使用python从网站抓取数据

问题描述

1 个解决方案

解决方案1 1 2014-08-23 14:47:46

解决方案1
1 2014-08-23 14:47:46