[英]How to scrape information from website that requires login
我正在研究 python web 抓取項目。 我試圖從中抓取數據的網站包含有關在印度銷售的所有葯品的信息。 該網站要求用戶登錄才能訪問此信息。
我想訪問此 url https://mims.com/india/browse/alphabet/a?cat=drug&tab=brand
中的所有鏈接並將其存儲在數組中。
這是我登錄網站的代碼
##################################### Method 1
import mechanize
import http.cookiejar as cookielib
from bs4 import BeautifulSoup
import html2text
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Chrome')]
br.open('https://sso.mims.com/Account/SignIn')
# View available forms
for f in br.forms():
print(f)
br.select_form(nr=0)
# User credentials
br.form['EmailAddress'] = <USERNAME>
br.form['Password'] = <PASSWORD>
# Login
br.submit()
print(br.open('https://mims.com/india/browse/alphabet/a?cat=drug&tab=brand').read())
但問題是,當提交憑證時,會彈出一個中間頁面,其中包含以下信息。
You will be redirected to your destination shortly.
此頁面提交隱藏表單,然后才顯示所需的結束頁面。 我想訪問結束頁面。 但是br.open('https://mims.com/india/browse/alphabet/a?cat=drug&tab=brand').read()
訪問中間頁面並打印結果。
如何等待中間頁面提交隱藏表單,然后訪問結束頁面的內容?
我在下面發布了一個selenium
解決方案,該解決方案有效,但在了解更多有關登錄過程后,可以使用BeautifulSoup
登錄並僅requests
。 請閱讀代碼上的注釋。
import requests
from bs4 import BeautifulSoup
d = {
"EmailAddress": "your@email.tld",
"Password": "password",
"RememberMe": True,
"SubscriberId": "",
"LicenseNumber": "",
"CountryCode": "SG"
}
req = requests.Session()
login_u = "https://sso.mims.com/"
html = req.post(login_u, data=d)
products_url = "https://mims.com/india/browse/alphabet/a?cat=drug"
html = req.get(products_url) # The cookies generated on the previous request will be use on this one automatically because we use Sessions
# Here's the tricky part. The site uses 2 intermediary "relogin" pages that (theoretically) are only available with JavaScript enabled, but we can bypass that, i.e.:
soup = BeautifulSoup(html.text, "html.parser")
form = soup.find('form', {"id": "openid_message"})
form_url = form['action'] # used on the next post request
inputs = form.find_all('input')
form_dict = {}
for input in inputs:
if input.get('name'):
form_dict[input.get('name')] = input.get('value')
form_dict['submit_button'] = "Continue"
relogin = req.post(form_url, data=form_dict)
soup = BeautifulSoup(relogin.text, "html.parser")
form = soup.find('form', {"id": "openid_message"})
form_url = form['action'] # used
inputs = form.find_all('input')
form_dict = {}
for input in inputs:
if input.get('name'):
form_dict[input.get('name')] = input.get('value')
products_a = req.post(form_url, data=form_dict)
print(products_a.text)
# You can now request any url normally because the necessary cookies are already present on the current Session()
products_url = "https://mims.com/india/browse/alphabet/c?cat=drug"
products_c = req.get(products_url)
print(products_c.text)
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from time import sleep
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 10)
driver.maximize_window()
driver.get("https://sso.mims.com/")
el = wait.until(EC.element_to_be_clickable((By.ID, "EmailAddress")))
el.send_keys("your@email.com")
el = wait.until(EC.element_to_be_clickable((By.ID, "Password")))
el.send_keys("password")
el = wait.until(EC.element_to_be_clickable((By.ID, "btnSubmit")))
el.click()
wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "profile-section-header"))) # we logged in successfully
driver.get("http://mims.com/india/browse/alphabet/a?cat=drug")
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "searchicon")))
print(driver.page_source)
# do what you need with the source code
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.