如何通过同时调用不同的 css 选择器来抓取 Selenium/Python 中的元素？

Question

I am trying to select the title of posts that are loaded in a webpage by integrating multiple css selectors.我正在尝试通过集成多个 css 选择器来 select 加载在网页中的帖子的标题。 See below my process:看下面我的过程：

Load relevant libraries加载相关库

import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Then load the content I wish to analyse然后加载我要分析的内容

options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)

browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[@type='search']").send_keys("international development",Keys.ENTER)
time.sleep(5)

scrolls = 2
while True:
    scrolls -= 1
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(5)
    if scrolls < 0:
        break

Then to get the content for each selector separately, call for css_selector然后分别获取每个选择器的内容，调用 css_selector

titles=browser.find_elements_by_css_selector("h3[class^='graf']")
TitlesList = []
for names in titles:
    names.text
    TitlesList.append(names.text) 

times=browser.find_elements_by_css_selector("time[datetime^='2016']")
Times = []
for names in times:
    names.text
    Times.append(names.text)

It all works so far...Now trying to bring them together with the aim to identify only choices from 2016到目前为止一切正常……现在试图将它们结合在一起，目的是确定 2016 年的唯一选择

choices = browser.find_elements_by_css_selector("time[datetime^='2016'] and h3[class^='graf']")    
browser.quit()

On this last snippet, I always get an empty list.在最后一个片段中，我总是得到一个空列表。

So I wonder 1) How can I select multiple elements by considering different css_selector as conditions for selection at the same time 2) if the syntax to find under multiple conditions would be the same to link elements by using different approaches like css_selector or x_paths and 3) if there is a way to get the text for elements identified by calling for multiple css selectors along a similar line of what below:所以我想知道 1）我如何通过同时考虑不同的 css_selector 作为选择条件来 select 多个元素 2）如果在多个条件下查找的语法与使用不同的方法（如 css_selector 或 x_paths 和 3）链接元素相同) 如果有一种方法可以获取通过调用多个 css 选择器识别的元素的文本，如下所示：

[pair.text for pair in browser.find_elements_by_css_selector("h3[class^='graf']") if pair.text]

Thanks谢谢

Answer 1

Firstly, I think what you're trying to do is to get any title that has time posted in 2016 right?首先，我认为你想要做的是获得任何有时间在 2016 年发布的标题，对吗？

You're using CSS selector "time[datetime^='2016'] and h3[class^='graf']" , but this will not work because its syntax is not valid ( and is not valid).您正在使用 CSS 选择器"time[datetime^='2016'] and h3[class^='graf']" ，但这不起作用，因为它的语法无效（ and无效）。 Plus, these are 2 different elements, CSS selector can only find 1 element.另外，这是 2 个不同的元素，CSS 选择器只能找到 1 个元素。 In your case, to add a condition from another element, use a common element like a parent element or something.在您的情况下，要从另一个元素添加条件，请使用父元素之类的公共元素。

I've checked the site, here's the HTML that you need to take a look at (if you're trying to the title that published in 2016).我检查了该站点，这是您需要查看的 HTML（如果您尝试使用 2016 年发布的标题）。 This is the minimal HTML part that can help you identify what you need to get.这是最小的 HTML 部件，可以帮助您确定您需要获得什么。

<div class="postArticle postArticle--short js-postArticle js-trackPostPresentation" data-post-id="d17220aecaa8"
    data-source="search_post---------2">
    <div class="u-clearfix u-marginBottom15 u-paddingTop5">
        <div class="postMetaInline u-floatLeft u-sm-maxWidthFullWidth">
            <div class="u-flexCenter">
                <div class="postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis">
                    <div
                        class="ui-caption u-fontSize12 u-baseColor--textNormal u-textColorNormal js-postMetaInlineSupplemental">
                        <a class="link link--darken"
                            href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
                            data-action="open-post"
                            data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
                            data-action-source="preview-listing">
                            <time datetime="2016-09-05T13:55:05.811Z">Sep 5, 2016</time>
                        </a>
                    </div>
                </div>
            </div>
        </div>
    </div>
    <div class="postArticle-content">
        <a href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
            data-action="open-post" data-action-source="search_post---------2"
            data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
            data-action-index="2" data-post-id="d17220aecaa8">
            <section class="section section--body section--first section--last">
                <div class="section-divider">
                    <hr class="section-divider">
                </div>
                <div class="section-content">
                    <div class="section-inner sectionLayout--insetColumn">
                        <h3 name="5910" id="5910" class="graf graf--h3 graf--leading graf--title">Reimagining
                            International Development for the 21st&nbsp;Century.</h3>
                    </div>
                </div>
            </section>
        </a>
    </div>
</div>

Both time and h3 are in a big div with class of postArticle . time和h3都在一个大div中，其中 class 为postArticle 。 The article contains time published & the title, so it makes sense to get the whole article div that published in 2016 right?文章包含发布时间和标题，因此获取 2016 年发布的整篇文章div是否有意义？

Using XPATH is much more powerful & easier to write:使用 XPATH 功能更强大且更易于编写：

This will get all articles div that contains class name of postArticle--short : article_xpath = '//div[contains(@class, "postArticle--short")]'这将获得所有包含 class 的文章div的postArticle--short名称： article_xpath = '//div[contains(@class, "postArticle--short")]'
This will get all time tag that contains class name of 2016 : //time[contains(@datetime, "2016")]这将获得包含 class 名称的所有time标签2016 ： //time[contains(@datetime, "2016")]

Let's combine both of them.让我们将它们结合起来。 I want to get article div that contains a time tag with classname of 2016 :我想获取包含类名为2016的time标签的文章div ：

article_2016_xpath = '//div[contains(@class, "postArticle--short")][.//time[contains(@datetime, "2016")]]'
article_element_list = driver.find_elements_by_xpath(article_2016_xpath)

# now let's get the title
for article in article_element_list:
    title = article.find_element_by_tag_name("h3").text

I haven't tested the code yet, only the xpath.我还没有测试代码，只有 xpath。 You might need to adapt the code to work on your side.您可能需要调整代码以在您身边工作。

By the way, using find_element... is not a good idea, try using explicit wait: https://selenium-python.readthedocs.io/waits.html顺便说一句，使用find_element...不是一个好主意，尝试使用显式等待： https://selenium-python.readthedocs.io/waits.html

This will help you to avoid making stupid time.sleep waits and improve your app performance, and you can handle errors pretty well.这将帮助您避免愚蠢的time.sleep等待并提高您的应用程序性能，并且您可以很好地处理错误。

Only use find_element... when you already located the element, and you need to find a child element inside.仅当您已经找到元素并且需要在其中找到子元素时才使用find_element... For example, in this case if I want to find articles, I will find by explicit wait, then after the element is located, I will use find_element... to find child element h3 .比如本例中如果要查找文章，我会通过显式等待查找，然后在找到元素后，我会使用find_element...查找子元素h3 。

如何通过同时调用不同的 css 选择器来抓取 Selenium/Python 中的元素？

问题描述

1 个解决方案

解决方案1
2 2021-04-08 08:19:36

如何通过同时调用不同的 css 选择器来抓取 Selenium/Python 中的元素？

问题描述

1 个解决方案

解决方案1 2 2021-04-08 08:19:36

解决方案1
2 2021-04-08 08:19:36