如何从容器元素中提取文本，同时在 selenium webdriver -python 抓取中迭代这些容器元素

Question

我正在尝试抓取http://quotes.toscrape.com/ 。 它在一个页面上包含多个框，每个框包含一个引文、提供引文的人的姓名和该引文的标签。 现在这就是我使用 python 在 selenium webdriver 中所做的：

driver = webdriver.Chrome()
driver.get("http://quotes.toscrape.com/")
sleep(2)
all_boxes = driver.find_elements_by_xpath(r"//div[@class='quote']")
for each in all_boxes:
    print(each.find_element_by_xpath('//span').text) // to print the quote

我在这里所做的很容易理解。 我选择了该页面上的所有框，然后对每个框进行迭代，我尝试使用 HTML 结构中观察到的所需 xpath 打印每个框中包含的引用。 但是得到的输出不是预期的。 即使我遍历每个框，输出也每次只打印第一个框中包含的引用。

输出是：

 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.” 
 “The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking.”

在这种非常具体的方法中，我无法找到这里出了什么问题。 请仅说明这种方法出了什么问题，因为我非常了解使用 selenium 或 beautifulsoup 库进行抓取的其他技术。 我只想知道为什么上面的编码方法不起作用。

Answer 1

要抓取网站http://quotes.toscrape.com/并提取报价，您必须构建一个定位器策略，该策略将识别网页上的所有报价，然后诱导WebDriverWait使所有元素可见并将它们存储在List 。 最后，您可以使用text方法按照以下解决方案提取所有文本：

代码块：

 from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC options = Options() options.add_argument("start-maximized") options.add_argument("disable-infobars") options.add_argument("--disable-extensions") driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\\Utility\\BrowserDrivers\\chromedriver.exe') driver.get("http://quotes.toscrape.com/") all_boxes = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='quote']/span[@class='text']"))) for each in all_boxes: print(each.text)

控制台输出：

 “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” “It is our choices, Harry, that show what we truly are, far more than our abilities.” “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” “Try not to become a man of success. Rather become a man of value.” “It is better to be hated for what you are than to be loved for what you are not.” “I have not failed. I've just found 10,000 ways that won't work.” “A woman is like a tea bag; you never know how strong it is until it's in hot water.” “A day without sunshine is like, you know, night.”

Answer 2

在迭代中是您的 xpath 出错了。 您应该提供的是您当前正在迭代的元素而不是整个文档的相对路径。 所以而不是

each.find_element_by_xpath('//span').text

把这个

each.find_element_by_xpath('./span').text

如何从容器元素中提取文本，同时在 selenium webdriver -python 抓取中迭代这些容器元素

问题描述

2 个解决方案

解决方案1
0 2018-05-18 15:37:07

解决方案2
0 2019-11-26 14:22:20

如何从容器元素中提取文本，同时在 selenium webdriver -python 抓取中迭代这些容器元素

问题描述

2 个解决方案

解决方案1 0 2018-05-18 15:37:07

解决方案2 0 2019-11-26 14:22:20

解决方案1
0 2018-05-18 15:37:07

解决方案2
0 2019-11-26 14:22:20