点击 python 中 Selenium 的多个链接

Question

我正在尝试从如下所示的结构中抓取数据：

<div class = "tables">
        <div class = "table1">
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "url1"
            </div>
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "url1">
            </div>
        </div>
        
        <div class = "table2">
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "url3"
            </div>
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "url4">
            </div>
        </div>
     </div>

我想要的数据在 div“数据”中，并且在通过单击 url 可访问的其他一些页面上。 我使用 BeautifulSoup 遍历“表”，并尝试单击 Selenium 的链接，如下所示：

tables = soup.find_all('div', class_ = 'tables')
 for line in tables:
     row = line.find_all('div', class_ = "row")
     for element in row:
         link = driver.find_element_by_xpath('//a[contains(@href,"href")]')
         #some code

在我的脚本中，这一行

link = driver.find_element_by_xpath('//a[contains(@href,"href")]')

总是返回第一个 url，当我希望它“关注”BeautifulSoup 并返回以下 hrefs 时。 那么有没有办法根据源代码中的 url 修改href？ 我应该补充一点，我所有的网址都非常相似，除了最后一部分。 （例如：url1 = questions/ask/ 1000 , url2 = questions/ask/ 1001 ）

我还尝试在页面中找到所有 href 以使用它们进行迭代

links = self.driver.find_element_by_xpath('//a[@href]')

但这也不起作用。 由于该页面包含许多对我无用的链接，我不确定这是否是 go 的最佳方式。

Answer 1

似乎有点复杂 - 为什么不直接用BeautifulSoup提取href ？

for a in soup.select('.tables a[href]'):
    link = a['href']

您还可以修改它，与 baseUrl 连接并存储在列表中以进行迭代：

urls = [baseUrl+a['href'] for a in soup.select('.tables a[href]')]

例子

baseUrl = 'http://www.example.com'

html='''
<div class = "tables">
        <div class = "table1">
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "/url1"
            </div>
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "/url1">
            </div>
        </div>

        <div class = "table2">
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "/url3"
            </div>
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "/url4">
            </div>
        </div>
     </div>'''
soup = BeautifulSoup(html,'lxml')

urls = [baseUrl+a['href'] for a in soup.select('.tables a[href]')]

for url in urls:
    print(url)#or request the website,....

Output

http://www.example.com/url1
http://www.example.com/url1
http://www.example.com/url3
http://www.example.com/url4

点击 python 中 Selenium 的多个链接

问题描述

1 个解决方案

解决方案1
0 2022-01-09 18:38:42

例子

Output

点击 python 中 Selenium 的多个链接

问题描述

1 个解决方案

解决方案1 0 2022-01-09 18:38:42

例子

Output

解决方案1
0 2022-01-09 18:38:42