繁体   English   中英

使用 Selenium - Python 抓取博客文章标题

[英]Scraping Blog Post Titles with Selenium - Python

I am trying to scrape the blog post titles using Selenium with Python of the following URL: https://blog.coinbase.com/tagged/coinbase-pro . 当我使用 Selenium 获取页面源时,它不包含博客文章标题,但是当我右键单击和 select“查看页面源”时,Chrome 源代码会包含。 我正在使用以下代码:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
pageSource = driver.page_source
print(pageSource)

任何帮助,将不胜感激。 谢谢。

wait=WebDriverWait(driver,30)                                 
driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
elements=wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".graf.graf--h3.graf-after--figure.graf--trailing.graf--title")))
for elem in elements:
   print(elem.text)

如果你想要这 8 个标题,你可以通过他们的 css 选择器使用等待来获取它们。

进口:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

输出:

Inverse Finance (INV), Liquity (LQTY), Polyswarm (NCT) and Propy (PRO) are launching on Coinbase Pro
Goldfinch Protocol (GFI) is launching on Coinbase Pro
Decentralized Social (DESO) is launching on Coinbase Pro
API3 (API3), Bluezelle (BLZ), Gods Unchained (GODS), Immutable X (IMX), Measurable Data Token (MDT) and Ribbon…
Circuits of Value (COVAL), IDEX (IDEX), Moss Carbon Credit (MCO2), Polkastarter (POLS), ShapeShift FOX Token (FOX)…
Voyager Token (VGX) is launching on Coinbase Pro
Alchemix (ALCX), Ethereum Name Service (ENS), Gala (GALA), mStable USD (MUSD) and Power Ledger (POWR) are launching…
Crypto.com Protocol (CRO) is launching on Coinbase Pro

您可以通过多种方式从该网页获取所有标题。 最有效和最快的方法是选择请求。

这是您可以使用请求获取标题的方式:

import re
import json
import time
import requests

link = 'https://medium.com/the-coinbase-blog/load-more'
params = {
    'sortBy': 'tagged',
    'tagSlug': 'coinbase-pro',
    'limit': 25,
    'to': int(time.time() * 1000),
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    s.headers['accept'] = 'application/json'
    s.headers['referer'] = 'https://blog.coinbase.com/tagged/coinbase-pro'
    
    while True:
        res = s.get(link,params=params)
        container = json.loads(re.findall("[^{]+(.*)",res.text)[0])
        for k,v in container['payload']['references']['Post'].items():
            title = v['title']
            print(title)

        try:
            next_page = container['payload']['paging']['next']['to']
        except KeyError:
            break

        params['to'] = next_page

但是,如果您要坚持使用的是 selenium,请尝试以下操作:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC

def scroll_down_to_the_bottom():
    check_height = driver.execute_script("return document.body.scrollHeight;")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        try:
            WebDriverWait(driver,10).until(lambda driver: driver.execute_script("return document.body.scrollHeight;")  > check_height)
            check_height = driver.execute_script("return document.body.scrollHeight;") 
        except TimeoutException:
             break

with webdriver.Chrome() as driver:                          
    driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
    scroll_down_to_the_bottom()
    for item in WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".section-content h3.graf--title"))):
       print(item.text)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM