简体   繁体   English

使用 Selenium 和 Python 进行网页抓取

[英]Webscraping with Selenium and Python

I'm beginner in coding and try to learn webscraping with selenium, I been working on a project to check with a dictionary how long it takes to crack a password with every single word.我是编码初学者,并尝试使用 selenium 学习网络抓取,我一直在研究一个项目,用字典检查破解每个单词的密码需要多长时间。 So my code reads a.txt file that has a word on each line, then writes it to the bar and it would copy how long it would take to crack it.所以我的代码读取每行有一个单词的.txt 文件,然后将其写入栏,它会复制破解它需要多长时间。 The problem is that I cannot capture a part of the html code of the webpage and I need help.问题是我无法捕获网页的 html 代码的一部分,我需要帮助。

This is my code这是我的代码

# This program run spanish dictionary and check how secure password there are

import random
import time
from selenium import webdriver

#Paste here Chromedriver path
CHROMEDRIVERPATH = "C:\Program Files (x86)\chromedriver.exe"
#Paste here dictionary path in .txt format
dictionary = readFile("spanish_dictionary.txt")
date = str(time.strftime("%Y-%m-%dT%H-%M-%S"))

#read files
driver = webdriver.Chrome(CHROMEDRIVERPATH)

#webpage target
driver.get("https://www.security.org/how-secure-is-my-password/")
time.sleep(2)

#Label
writeFile("results_" + date + ".txt","word,time \n")
#File Content
for word in dictionary:
    bar = driver.find_element_by_id('password')
    bar.send_keys(word)
    bar.clear()
    timeToCrack = driver.find_element_by_xpath('//*[@id="hsimp"]/div[1]/div[3]/p[2]').get_attribute("class")
    result = word + "," + timeToCrack + "\n"
    writeFile("results_" + date + ".txt",result)
    time.sleep(random.uniform(0.4,1.0))

This is html code of the page这是页面的 html 代码

<p class="result__text result__time">2 hundred microseconds</p>

I get this in output file:我在 output 文件中得到这个:

word,time 
a,result__text result__time
aba,result__text result__time
abaá,result__text result__time

I want this:我要这个:

word,time 
a,6 hundred picoseconds
aba,4 hundred nanoseconds
abaá,5 milliseconds

You want:你要:

timeToCrack = driver.find_element_by_xpath('//*[@id="hsimp"]/div[1]/div[3]/p[2]').text

The Java equivalent is: Java 等效项是:

driver.findElement(By.xpath("//*[@id="hsimp"]/div[1]/div[3]/p[2]").getText();

To extract and print the result you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies :要提取和打印结果,您需要为visibility_of_element_located()引入WebDriverWait ,您可以使用以下任一Locator Strategies

  • Using CSS_SELECTOR and get_attribute() :使用CSS_SELECTORget_attribute()

     driver.get('https://www.security.org/how-secure-is-my-password/') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#password"))).send_keys("lordkoda") print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.result p.result__text.result__time"))).get_attribute("innerHTML"))
  • Using XPATH and text attribute:使用XPATH文本属性:

     driver.get('https://www.security.org/how-secure-is-my-password/') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@id='password']"))).send_keys("lordkoda") print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='result']//p[@class='result__text result__time']"))).text)
  • Console Output:控制台 Output:

     5 seconds
  • Note : You have to add the following imports:注意:您必须添加以下导入:

     from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM