如何通过同时调用不同的 css 选择器来抓取 Selenium/Python 中的元素？

Question

我正在尝试通过集成多个 css 选择器来 select 加载在网页中的帖子的标题。 看下面我的过程：

加载相关库

import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

然后加载我要分析的内容

options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)

browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[@type='search']").send_keys("international development",Keys.ENTER)
time.sleep(5)

scrolls = 2
while True:
    scrolls -= 1
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(5)
    if scrolls < 0:
        break

然后分别获取每个选择器的内容，调用 css_selector

titles=browser.find_elements_by_css_selector("h3[class^='graf']")
TitlesList = []
for names in titles:
    names.text
    TitlesList.append(names.text) 

times=browser.find_elements_by_css_selector("time[datetime^='2016']")
Times = []
for names in times:
    names.text
    Times.append(names.text)

到目前为止一切正常……现在试图将它们结合在一起，目的是确定 2016 年的唯一选择

choices = browser.find_elements_by_css_selector("time[datetime^='2016'] and h3[class^='graf']")    
browser.quit()

在最后一个片段中，我总是得到一个空列表。

所以我想知道 1）我如何通过同时考虑不同的 css_selector 作为选择条件来 select 多个元素 2）如果在多个条件下查找的语法与使用不同的方法（如 css_selector 或 x_paths 和 3）链接元素相同) 如果有一种方法可以获取通过调用多个 css 选择器识别的元素的文本，如下所示：

[pair.text for pair in browser.find_elements_by_css_selector("h3[class^='graf']") if pair.text]

谢谢

Answer 1

首先，我认为你想要做的是获得任何有时间在 2016 年发布的标题，对吗？

您正在使用 CSS 选择器"time[datetime^='2016'] and h3[class^='graf']" ，但这不起作用，因为它的语法无效（ and无效）。 另外，这是 2 个不同的元素，CSS 选择器只能找到 1 个元素。 在您的情况下，要从另一个元素添加条件，请使用父元素之类的公共元素。

我检查了该站点，这是您需要查看的 HTML（如果您尝试使用 2016 年发布的标题）。 这是最小的 HTML 部件，可以帮助您确定您需要获得什么。

<div class="postArticle postArticle--short js-postArticle js-trackPostPresentation" data-post-id="d17220aecaa8"
    data-source="search_post---------2">
    <div class="u-clearfix u-marginBottom15 u-paddingTop5">
        <div class="postMetaInline u-floatLeft u-sm-maxWidthFullWidth">
            <div class="u-flexCenter">
                <div class="postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis">
                    <div
                        class="ui-caption u-fontSize12 u-baseColor--textNormal u-textColorNormal js-postMetaInlineSupplemental">
                        <a class="link link--darken"
                            href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
                            data-action="open-post"
                            data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
                            data-action-source="preview-listing">
                            <time datetime="2016-09-05T13:55:05.811Z">Sep 5, 2016</time>
                        </a>
                    </div>
                </div>
            </div>
        </div>
    </div>
    <div class="postArticle-content">
        <a href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
            data-action="open-post" data-action-source="search_post---------2"
            data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
            data-action-index="2" data-post-id="d17220aecaa8">
            <section class="section section--body section--first section--last">
                <div class="section-divider">
                    <hr class="section-divider">
                </div>
                <div class="section-content">
                    <div class="section-inner sectionLayout--insetColumn">
                        <h3 name="5910" id="5910" class="graf graf--h3 graf--leading graf--title">Reimagining
                            International Development for the 21st&nbsp;Century.</h3>
                    </div>
                </div>
            </section>
        </a>
    </div>
</div>

time和h3都在一个大div中，其中 class 为postArticle 。 文章包含发布时间和标题，因此获取 2016 年发布的整篇文章div是否有意义？

使用 XPATH 功能更强大且更易于编写：

这将获得所有包含 class 的文章div的postArticle--short名称： article_xpath = '//div[contains(@class, "postArticle--short")]'
这将获得包含 class 名称的所有time标签2016 ： //time[contains(@datetime, "2016")]

让我们将它们结合起来。 我想获取包含类名为2016的time标签的文章div ：

article_2016_xpath = '//div[contains(@class, "postArticle--short")][.//time[contains(@datetime, "2016")]]'
article_element_list = driver.find_elements_by_xpath(article_2016_xpath)

# now let's get the title
for article in article_element_list:
    title = article.find_element_by_tag_name("h3").text

我还没有测试代码，只有 xpath。 您可能需要调整代码以在您身边工作。

顺便说一句，使用find_element...不是一个好主意，尝试使用显式等待： https://selenium-python.readthedocs.io/waits.html

这将帮助您避免愚蠢的time.sleep等待并提高您的应用程序性能，并且您可以很好地处理错误。

仅当您已经找到元素并且需要在其中找到子元素时才使用find_element... 比如本例中如果要查找文章，我会通过显式等待查找，然后在找到元素后，我会使用find_element...查找子元素h3 。

如何通过同时调用不同的 css 选择器来抓取 Selenium/Python 中的元素？

问题描述

1 个解决方案

解决方案1
2 2021-04-08 08:19:36

如何通过同时调用不同的 css 选择器来抓取 Selenium/Python 中的元素？

问题描述

1 个解决方案

解决方案1 2 2021-04-08 08:19:36

解决方案1
2 2021-04-08 08:19:36