简体   繁体   English

如何抓取具有用户名和密码的网站?

[英]How to scrape a website thet has username and password?

I am trying to scrape a website and make my program know all the buttons and links that inside of that website but my problem is that to get to the first page I need to enter a username and a password and then scraping the page that shows after that and every time it's scraping to the page with the password and the username someone knows how to do that?我正在尝试抓取一个网站并让我的程序知道该网站内部的所有按钮和链接,但我的问题是要进入第一页,我需要输入用户名和密码,然后抓取之后显示的页面每次使用密码和用户名抓取页面时,有人知道该怎么做吗? because I don't know-how this is the code that I tried:因为我不知道这是我尝试过的代码:

import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.ronitnisan.co.il/admin/UnPermissionPage.asp?isiframe=")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.NAME, "FirstName"))
    )
except:
    driver.quit()
userName = driver.find_element_by_name("FirstName")
userName.clear()
userName.send_keys("username")
password = driver.find_element_by_name("UserIDNumber")
password.clear()
password.send_keys("username")
time.sleep(0.5)
login = driver.find_element_by_name("submit")
login.click()
URL = 'https://www.ronitnisan.co.il/admin/UnPermissionPage.asp?isiframe='
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)

You are starting a Chrome 'session' (I don't know if that is the correct word for it) up until and including the try: code block.您正在启动 Chrome 的“会话”(我不知道这是否是正确的词),直到包含try:代码块。 You use that session to enter the username and password, so far so good.您使用该 session 输入用户名和密码,到目前为止一切顺利。

Then you abandon that session altogether and just use a requests.get() statement to get an URL.然后你完全放弃 session 并使用requests.get()语句来获取 URL。 That url does not have any login information (either via cookies of sessions via your browser, as the login was done via the driver variable. url 没有任何登录信息(通过浏览器的会话的 cookies,因为登录是通过driver变量完成的。

The human equivalent of this is login into a website with Firefox and then try to visit the same website with Edge.这相当于人类使用 Firefox 登录网站,然后尝试使用 Edge 访问同一网站。 They won't share the same session and you will have to login again in Edge in that case.他们不会共享相同的 session,在这种情况下,您将不得不再次登录 Edge。

What you might want to try is something like this (after login.click() )您可能想尝试的是这样的(在login.click()之后)

soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup)

Replace代替

URL = 'https://www.ronitnisan.co.il/admin/UnPermissionPage.asp?isiframe='
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)

with

driver.get (URL)

and then use find_element to track down the parts of the page you are interested in.然后使用 find_element 跟踪您感兴趣的页面部分。

Otherwise you want to capture the cookies to use with requests.否则,您想捕获 cookies 以用于请求。

How about this?这个怎么样?

from bs4 import BeautifulSoup
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
import time

wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)
url = "https://the_url"
wd.get(url)

# set username
time.sleep(5)
username = wd.find_element_by_id("FirstName")
username.send_keys("your_id")
#wd.find_element_by_id("identifierNext").click()

# set password
#time.sleep(2)
password = wd.find_element_by_id("Password1")
password.send_keys("your_password")
elements = wd.find_elements_by_class_name("submit")
for e in elements:
    e.click()

# wait max 10 seconds until "theID" visible in Logged In page
time.sleep(10)
content = wd.page_source
print(BeautifulSoup(content, 'html.parser'))


file = open('C:\\your_path_here\\test.txt', 'w', encoding='utf-8')
file.write(content)
file.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM