简体   繁体   中英

How to scrape elements in Selenium/Python by calling different css selectors at the same time?

I am trying to select the title of posts that are loaded in a webpage by integrating multiple css selectors. See below my process:

Load relevant libraries

import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Then load the content I wish to analyse

options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)

browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[@type='search']").send_keys("international development",Keys.ENTER)
time.sleep(5)

scrolls = 2
while True:
    scrolls -= 1
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(5)
    if scrolls < 0:
        break

Then to get the content for each selector separately, call for css_selector

titles=browser.find_elements_by_css_selector("h3[class^='graf']")
TitlesList = []
for names in titles:
    names.text
    TitlesList.append(names.text) 

times=browser.find_elements_by_css_selector("time[datetime^='2016']")
Times = []
for names in times:
    names.text
    Times.append(names.text) 

It all works so far...Now trying to bring them together with the aim to identify only choices from 2016

choices = browser.find_elements_by_css_selector("time[datetime^='2016'] and h3[class^='graf']")    
browser.quit()

On this last snippet, I always get an empty list.

So I wonder 1) How can I select multiple elements by considering different css_selector as conditions for selection at the same time 2) if the syntax to find under multiple conditions would be the same to link elements by using different approaches like css_selector or x_paths and 3) if there is a way to get the text for elements identified by calling for multiple css selectors along a similar line of what below:

[pair.text for pair in browser.find_elements_by_css_selector("h3[class^='graf']") if pair.text]

Thanks

Firstly, I think what you're trying to do is to get any title that has time posted in 2016 right?

You're using CSS selector "time[datetime^='2016'] and h3[class^='graf']" , but this will not work because its syntax is not valid ( and is not valid). Plus, these are 2 different elements, CSS selector can only find 1 element. In your case, to add a condition from another element, use a common element like a parent element or something.

I've checked the site, here's the HTML that you need to take a look at (if you're trying to the title that published in 2016). This is the minimal HTML part that can help you identify what you need to get.

<div class="postArticle postArticle--short js-postArticle js-trackPostPresentation" data-post-id="d17220aecaa8"
    data-source="search_post---------2">
    <div class="u-clearfix u-marginBottom15 u-paddingTop5">
        <div class="postMetaInline u-floatLeft u-sm-maxWidthFullWidth">
            <div class="u-flexCenter">
                <div class="postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis">
                    <div
                        class="ui-caption u-fontSize12 u-baseColor--textNormal u-textColorNormal js-postMetaInlineSupplemental">
                        <a class="link link--darken"
                            href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
                            data-action="open-post"
                            data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
                            data-action-source="preview-listing">
                            <time datetime="2016-09-05T13:55:05.811Z">Sep 5, 2016</time>
                        </a>
                    </div>
                </div>
            </div>
        </div>
    </div>
    <div class="postArticle-content">
        <a href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
            data-action="open-post" data-action-source="search_post---------2"
            data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
            data-action-index="2" data-post-id="d17220aecaa8">
            <section class="section section--body section--first section--last">
                <div class="section-divider">
                    <hr class="section-divider">
                </div>
                <div class="section-content">
                    <div class="section-inner sectionLayout--insetColumn">
                        <h3 name="5910" id="5910" class="graf graf--h3 graf--leading graf--title">Reimagining
                            International Development for the 21st&nbsp;Century.</h3>
                    </div>
                </div>
            </section>
        </a>
    </div>
</div>

Both time and h3 are in a big div with class of postArticle . The article contains time published & the title, so it makes sense to get the whole article div that published in 2016 right?

Using XPATH is much more powerful & easier to write:

  • This will get all articles div that contains class name of postArticle--short : article_xpath = '//div[contains(@class, "postArticle--short")]'
  • This will get all time tag that contains class name of 2016 : //time[contains(@datetime, "2016")]

Let's combine both of them. I want to get article div that contains a time tag with classname of 2016 :

article_2016_xpath = '//div[contains(@class, "postArticle--short")][.//time[contains(@datetime, "2016")]]'
article_element_list = driver.find_elements_by_xpath(article_2016_xpath)

# now let's get the title
for article in article_element_list:
    title = article.find_element_by_tag_name("h3").text

I haven't tested the code yet, only the xpath. You might need to adapt the code to work on your side.


By the way, using find_element... is not a good idea, try using explicit wait: https://selenium-python.readthedocs.io/waits.html

This will help you to avoid making stupid time.sleep waits and improve your app performance, and you can handle errors pretty well.

Only use find_element... when you already located the element, and you need to find a child element inside. For example, in this case if I want to find articles, I will find by explicit wait, then after the element is located, I will use find_element... to find child element h3 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM