简体   繁体   English

使用 Python 和 Selenium 从标签中获取文本

[英]Get text out of tags with Python and Selenium

I have been trying to scrape a webpage with Python and Selenium and ran into this problem.我一直在尝试用 Python 和 Selenium 抓取一个网页,但遇到了这个问题。 Basically, the webpage that I'm scraping shows information in a table with pagination, so I want to get the information from all pages.基本上,我正在抓取的网页在带有分页的表格中显示信息,因此我想从所有页面获取信息。 This is the HTML for the pagination system when I'm at a page that's not the last page (page 2 in this case):这是分页系统的 HTML,当我在不是最后一页的页面(本例中为第 2 页)时:

<span class="pagelinks">
   " ["
   <a href="?page=1">First</a>
   "/"
   <a href="?page=2">Previous</a>
   "] "
   <a href="?page=1" title="Go to page 1">1</a>
   ", "
   <strong>2</strong>
   ", "
   <a href="?page=3" title="Go to page 3">3</a>
   " ["
   <a href="?page=3">Next</a>
   "/"
   <a href="?page=3">Last</a>
   "] "
</span>

And this is the HTML I get when I reach the last page (page 3 in this case):这是我到达最后一页(在本例中为第 3 页)时得到的 HTML:

<span class="pagelinks">
   " ["
   <a href="?page=1">First</a>
   "/"
   <a href="?page=2">Previous</a>
   "] "
   <a href="?page=1" title="Go to page 1">1</a>
   ", "
   <a href="?page=2" title="Go to page 2">2</a>
   ", "
   <strong>3</strong>
   " [Next/Last]"
</span>

In this case, page 3 is selected and appears as <strong> , but this changes depending on the current page.在这种情况下,第 3 页被选中并显示为<strong> ,但这会根据当前页面而变化。

In order to check if I'm at the last page, I want to check if the text "[Next/Last]" is the next text after the <strong> tag to stop the while loop that retrieves the information, but since this text is out of any tag, I didn't find any way to check this.为了检查我是否在最后一页,我想检查文本“[Next/Last]”是否是<strong>标记之后的下一个文本以停止检索信息的while循环,但是由于此文本不在任何标签中,我没有找到任何方法来检查它。 How can I check it?我怎样才能检查它?

We can look for a with an href attribute and Next text content.我们可以查找a href属性和Next文本内容的 a。 The same can be done for the Last text.可以对Last文本执行相同的操作。

With Selenium / Python you can simply use this line:对于 Selenium / Python,您可以简单地使用这一行:

if driver.find_elements(By.XPATH, "//span[@='pagelinks']//a[@href][contains(text(),'Next')]"):
    # Do what you need to do while still not on the last
    # page. Otherwise, this block will be skipped.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM