[英]How to scrape elements in Selenium/Python by calling different css selectors at the same time?
I am trying to select the title of posts that are loaded in a webpage by integrating multiple css selectors.我正在尝试通过集成多个 css 选择器来 select 加载在网页中的帖子的标题。 See below my process:
看下面我的过程:
Load relevant libraries加载相关库
import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Then load the content I wish to analyse然后加载我要分析的内容
options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)
browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[@type='search']").send_keys("international development",Keys.ENTER)
time.sleep(5)
scrolls = 2
while True:
scrolls -= 1
browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(5)
if scrolls < 0:
break
Then to get the content for each selector separately, call for css_selector然后分别获取每个选择器的内容,调用 css_selector
titles=browser.find_elements_by_css_selector("h3[class^='graf']")
TitlesList = []
for names in titles:
names.text
TitlesList.append(names.text)
times=browser.find_elements_by_css_selector("time[datetime^='2016']")
Times = []
for names in times:
names.text
Times.append(names.text)
It all works so far...Now trying to bring them together with the aim to identify only choices from 2016到目前为止一切正常……现在试图将它们结合在一起,目的是确定 2016 年的唯一选择
choices = browser.find_elements_by_css_selector("time[datetime^='2016'] and h3[class^='graf']")
browser.quit()
On this last snippet, I always get an empty list.在最后一个片段中,我总是得到一个空列表。
So I wonder 1) How can I select multiple elements by considering different css_selector as conditions for selection at the same time 2) if the syntax to find under multiple conditions would be the same to link elements by using different approaches like css_selector or x_paths and 3) if there is a way to get the text for elements identified by calling for multiple css selectors along a similar line of what below:所以我想知道 1)我如何通过同时考虑不同的 css_selector 作为选择条件来 select 多个元素 2)如果在多个条件下查找的语法与使用不同的方法(如 css_selector 或 x_paths 和 3)链接元素相同) 如果有一种方法可以获取通过调用多个 css 选择器识别的元素的文本,如下所示:
[pair.text for pair in browser.find_elements_by_css_selector("h3[class^='graf']") if pair.text]
Thanks谢谢
Firstly, I think what you're trying to do is to get any title that has time posted in 2016 right?首先,我认为你想要做的是获得任何有时间在 2016 年发布的标题,对吗?
You're using CSS selector "time[datetime^='2016'] and h3[class^='graf']"
, but this will not work because its syntax is not valid ( and
is not valid).您正在使用 CSS 选择器
"time[datetime^='2016'] and h3[class^='graf']"
,但这不起作用,因为它的语法无效( and
无效)。 Plus, these are 2 different elements, CSS selector can only find 1 element.另外,这是 2 个不同的元素,CSS 选择器只能找到 1 个元素。 In your case, to add a condition from another element, use a common element like a parent element or something.
在您的情况下,要从另一个元素添加条件,请使用父元素之类的公共元素。
I've checked the site, here's the HTML that you need to take a look at (if you're trying to the title that published in 2016).我检查了该站点,这是您需要查看的 HTML(如果您尝试使用 2016 年发布的标题)。 This is the minimal HTML part that can help you identify what you need to get.
这是最小的 HTML 部件,可以帮助您确定您需要获得什么。
<div class="postArticle postArticle--short js-postArticle js-trackPostPresentation" data-post-id="d17220aecaa8"
data-source="search_post---------2">
<div class="u-clearfix u-marginBottom15 u-paddingTop5">
<div class="postMetaInline u-floatLeft u-sm-maxWidthFullWidth">
<div class="u-flexCenter">
<div class="postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis">
<div
class="ui-caption u-fontSize12 u-baseColor--textNormal u-textColorNormal js-postMetaInlineSupplemental">
<a class="link link--darken"
href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action="open-post"
data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action-source="preview-listing">
<time datetime="2016-09-05T13:55:05.811Z">Sep 5, 2016</time>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="postArticle-content">
<a href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action="open-post" data-action-source="search_post---------2"
data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action-index="2" data-post-id="d17220aecaa8">
<section class="section section--body section--first section--last">
<div class="section-divider">
<hr class="section-divider">
</div>
<div class="section-content">
<div class="section-inner sectionLayout--insetColumn">
<h3 name="5910" id="5910" class="graf graf--h3 graf--leading graf--title">Reimagining
International Development for the 21st Century.</h3>
</div>
</div>
</section>
</a>
</div>
</div>
Both time
and h3
are in a big div
with class of postArticle
. time
和h3
都在一个大div
中,其中 class 为postArticle
。 The article contains time published & the title, so it makes sense to get the whole article div
that published in 2016 right?文章包含发布时间和标题,因此获取 2016 年发布的整篇文章
div
是否有意义?
Using XPATH is much more powerful & easier to write:使用 XPATH 功能更强大且更易于编写:
div
that contains class name of postArticle--short
: article_xpath = '//div[contains(@class, "postArticle--short")]'
div
的postArticle--short
名称: article_xpath = '//div[contains(@class, "postArticle--short")]'
time
tag that contains class name of 2016
: //time[contains(@datetime, "2016")]
time
标签2016
: //time[contains(@datetime, "2016")]
Let's combine both of them.让我们将它们结合起来。 I want to get article
div
that contains a time
tag with classname of 2016
:我想获取包含类名为
2016
的time
标签的文章div
:
article_2016_xpath = '//div[contains(@class, "postArticle--short")][.//time[contains(@datetime, "2016")]]'
article_element_list = driver.find_elements_by_xpath(article_2016_xpath)
# now let's get the title
for article in article_element_list:
title = article.find_element_by_tag_name("h3").text
I haven't tested the code yet, only the xpath.我还没有测试代码,只有 xpath。 You might need to adapt the code to work on your side.
您可能需要调整代码以在您身边工作。
By the way, using find_element...
is not a good idea, try using explicit wait: https://selenium-python.readthedocs.io/waits.html顺便说一句,使用
find_element...
不是一个好主意,尝试使用显式等待: https://selenium-python.readthedocs.io/waits.html
This will help you to avoid making stupid time.sleep
waits and improve your app performance, and you can handle errors pretty well.这将帮助您避免愚蠢的
time.sleep
等待并提高您的应用程序性能,并且您可以很好地处理错误。
Only use find_element...
when you already located the element, and you need to find a child element inside.仅当您已经找到元素并且需要在其中找到子元素时才使用
find_element...
For example, in this case if I want to find articles, I will find by explicit wait, then after the element is located, I will use find_element...
to find child element h3
.比如本例中如果要查找文章,我会通过显式等待查找,然后在找到元素后,我会使用
find_element...
查找子元素h3
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.