简体   繁体   English

如何通过同时调用不同的 css 选择器来抓取 Selenium/Python 中的元素?

[英]How to scrape elements in Selenium/Python by calling different css selectors at the same time?

I am trying to select the title of posts that are loaded in a webpage by integrating multiple css selectors.我正在尝试通过集成多个 css 选择器来 select 加载在网页中的帖子的标题。 See below my process:看下面我的过程:

Load relevant libraries加载相关库

import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Then load the content I wish to analyse然后加载我要分析的内容

options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)

browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[@type='search']").send_keys("international development",Keys.ENTER)
time.sleep(5)

scrolls = 2
while True:
    scrolls -= 1
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(5)
    if scrolls < 0:
        break

Then to get the content for each selector separately, call for css_selector然后分别获取每个选择器的内容,调用 css_selector

titles=browser.find_elements_by_css_selector("h3[class^='graf']")
TitlesList = []
for names in titles:
    names.text
    TitlesList.append(names.text) 

times=browser.find_elements_by_css_selector("time[datetime^='2016']")
Times = []
for names in times:
    names.text
    Times.append(names.text) 

It all works so far...Now trying to bring them together with the aim to identify only choices from 2016到目前为止一切正常……现在试图将它们结合在一起,目的是确定 2016 年的唯一选择

choices = browser.find_elements_by_css_selector("time[datetime^='2016'] and h3[class^='graf']")    
browser.quit()

On this last snippet, I always get an empty list.在最后一个片段中,我总是得到一个空列表。

So I wonder 1) How can I select multiple elements by considering different css_selector as conditions for selection at the same time 2) if the syntax to find under multiple conditions would be the same to link elements by using different approaches like css_selector or x_paths and 3) if there is a way to get the text for elements identified by calling for multiple css selectors along a similar line of what below:所以我想知道 1)我如何通过同时考虑不同的 css_selector 作为选择条件来 select 多个元素 2)如果在多个条件下查找的语法与使用不同的方法(如 css_selector 或 x_paths 和 3)链接元素相同) 如果有一种方法可以获取通过调用多个 css 选择器识别的元素的文本,如下所示:

[pair.text for pair in browser.find_elements_by_css_selector("h3[class^='graf']") if pair.text]

Thanks谢谢

Firstly, I think what you're trying to do is to get any title that has time posted in 2016 right?首先,我认为你想要做的是获得任何有时间在 2016 年发布的标题,对吗?

You're using CSS selector "time[datetime^='2016'] and h3[class^='graf']" , but this will not work because its syntax is not valid ( and is not valid).您正在使用 CSS 选择器"time[datetime^='2016'] and h3[class^='graf']" ,但这不起作用,因为它的语法无效( and无效)。 Plus, these are 2 different elements, CSS selector can only find 1 element.另外,这是 2 个不同的元素,CSS 选择器只能找到 1 个元素。 In your case, to add a condition from another element, use a common element like a parent element or something.在您的情况下,要从另一个元素添加条件,请使用父元素之类的公共元素。

I've checked the site, here's the HTML that you need to take a look at (if you're trying to the title that published in 2016).我检查了该站点,这是您需要查看的 HTML(如果您尝试使用 2016 年发布的标题)。 This is the minimal HTML part that can help you identify what you need to get.这是最小的 HTML 部件,可以帮助您确定您需要获得什么。

<div class="postArticle postArticle--short js-postArticle js-trackPostPresentation" data-post-id="d17220aecaa8"
    data-source="search_post---------2">
    <div class="u-clearfix u-marginBottom15 u-paddingTop5">
        <div class="postMetaInline u-floatLeft u-sm-maxWidthFullWidth">
            <div class="u-flexCenter">
                <div class="postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis">
                    <div
                        class="ui-caption u-fontSize12 u-baseColor--textNormal u-textColorNormal js-postMetaInlineSupplemental">
                        <a class="link link--darken"
                            href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
                            data-action="open-post"
                            data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
                            data-action-source="preview-listing">
                            <time datetime="2016-09-05T13:55:05.811Z">Sep 5, 2016</time>
                        </a>
                    </div>
                </div>
            </div>
        </div>
    </div>
    <div class="postArticle-content">
        <a href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
            data-action="open-post" data-action-source="search_post---------2"
            data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
            data-action-index="2" data-post-id="d17220aecaa8">
            <section class="section section--body section--first section--last">
                <div class="section-divider">
                    <hr class="section-divider">
                </div>
                <div class="section-content">
                    <div class="section-inner sectionLayout--insetColumn">
                        <h3 name="5910" id="5910" class="graf graf--h3 graf--leading graf--title">Reimagining
                            International Development for the 21st&nbsp;Century.</h3>
                    </div>
                </div>
            </section>
        </a>
    </div>
</div>

Both time and h3 are in a big div with class of postArticle . timeh3都在一个大div中,其中 class 为postArticle The article contains time published & the title, so it makes sense to get the whole article div that published in 2016 right?文章包含发布时间和标题,因此获取 2016 年发布的整篇文章div是否有意义?

Using XPATH is much more powerful & easier to write:使用 XPATH 功能更强大且更易于编写:

  • This will get all articles div that contains class name of postArticle--short : article_xpath = '//div[contains(@class, "postArticle--short")]'这将获得所有包含 class 的文章divpostArticle--short名称: article_xpath = '//div[contains(@class, "postArticle--short")]'
  • This will get all time tag that contains class name of 2016 : //time[contains(@datetime, "2016")]这将获得包含 class 名称的所有time标签2016//time[contains(@datetime, "2016")]

Let's combine both of them.让我们将它们结合起来。 I want to get article div that contains a time tag with classname of 2016 :我想获取包含类名为2016time标签的文章div

article_2016_xpath = '//div[contains(@class, "postArticle--short")][.//time[contains(@datetime, "2016")]]'
article_element_list = driver.find_elements_by_xpath(article_2016_xpath)

# now let's get the title
for article in article_element_list:
    title = article.find_element_by_tag_name("h3").text

I haven't tested the code yet, only the xpath.我还没有测试代码,只有 xpath。 You might need to adapt the code to work on your side.您可能需要调整代码以在您身边工作。


By the way, using find_element... is not a good idea, try using explicit wait: https://selenium-python.readthedocs.io/waits.html顺便说一句,使用find_element...不是一个好主意,尝试使用显式等待: https://selenium-python.readthedocs.io/waits.html

This will help you to avoid making stupid time.sleep waits and improve your app performance, and you can handle errors pretty well.这将帮助您避免愚蠢的time.sleep等待并提高您的应用程序性能,并且您可以很好地处理错误。

Only use find_element... when you already located the element, and you need to find a child element inside.仅当您已经找到元素并且需要在其中找到子元素时才使用find_element... For example, in this case if I want to find articles, I will find by explicit wait, then after the element is located, I will use find_element... to find child element h3 .比如本例中如果要查找文章,我会通过显式等待查找,然后在找到元素后,我会使用find_element...查找子元素h3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在python selenium中循环多个元素(不同的CSS选择器) - How to loop multiple elements in python selenium (different CSS selectors) Python Scrapy - 如何同时从 2 个不同的网站进行抓取? - Python Scrapy - How to scrape from 2 different website at the same time? 如何在 python 上获取 selenium web 驱动程序以在以下页面的 ZC7A628CBA22E28EB17B5F5C6AE2A26 上查找元素? - How to get selenium web driver on python to find elements on css selectors of a following page? 同时点击多个元素 selenium Python - Click multiple elements at the same time selenium Python 如何在 Python Selenium 中抓取? - How to scrape in Python Selenium? 如何使用python硒同时打开不同的URL? - how to open different urls at the same time by using python selenium? 你如何刮偶数<TD>元素与蟒蛇/硒? - How do you scrape even numbered <TD> elements with python/selenium? 如何使用 Selenium Python 抓取实时股票报价 - How to scrape the real time stock price quote using Selenium Python 如何使用 Python Selenium Chrome 驱动程序抓取每个特定的时间? - How to scrape every specific amount of time with Python Selenium Chrome Driver? 使用 Python 的剧作家,我如何同时等待两个不同的选择器/句柄并取得第一个成功的匹配? - Using playwright for Python, how do I wait for two different selectors/handles at the same time and take the first successful match?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM