請求無法正常工作的網頁抓取

Question

我正在嘗試從CNN獲取 html 用於個人項目。 我正在使用請求庫並且是它的新手。 我已經按照基本教程使用請求從 CNN 獲取 HTML，但是當我從瀏覽器檢查網頁時，不斷收到與我發現的 HTML 不同的響應。 這是我的代碼：

base_url = 'https://www.cnn.com/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())

我正在嘗試從 CNN 獲取文章標題，但這是我的第一個問題。 謝謝您的幫助！

更新似乎我知道的比我最初假設的還要少。 我真正的問題是：如何從 CNN 主頁中提取標題？ 我已經嘗試了這兩個答案，但是請求中的 HTML 不包含標題信息。 我怎樣才能得到這張圖片中的標題信息（我的瀏覽器的截圖） cnn 文章標題的截圖以及並排的 html

Answer 1

您可以使用Selenium ChromeDriver來抓取https://cnn.com 。

import bs4 as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
driver = webdriver.Chrome("---CHROMEDRIVER-PATH---", options=chrome_options)

driver.get('https://cnn.com/')
soup = bs.BeautifulSoup(driver.page_source, 'lxml')

# Get Titles from HTML.
titles = soup.find_all('span', {'class': 'cd__headline-text'})
print(titles)

# Close ChromeDriver.
driver.close()
driver.quit()

輸出：

[<span class="cd__headline-text"><strong>The West turned Aung San Suu Kyi into a saint. She was always going to disappoint </strong></span>, <span class="cd__headline-text"><strong>In Hindu-nationalist India, Muslims risk being branded infiltrators</strong></span>, <span class="cd__headline-text">Johnson may have stormed to victory, but he's got a problem</span>, <span class="cd__headline-text">Impeachment heads to full House after historic vote</span>, <span class="cd__headline-text">Supreme Court to decide on Trump's financial records</span>, <span class="cd__headline-text">Michelle Obama's message for Thunberg after Trump mocks her</span>, <span class="cd__headline-text">Actor Danny Aiello dies at 86</span>, <span class="cd__headline-text">The biggest risk at the North Pole isn't what you think</span>, <span class="cd__headline-text">US city declares state of emergency after cyberattack </span>, <span class="cd__headline-text">Reality TV show host arrested</span>, <span class="cd__headline-text">Big names in 2019 you may have mispronounced</span>, <span class="cd__headline-text"><strong>Morocco has Africa's 'first fully solar village'</strong></span>]

您可以從這里下載 ChromeDriver。

Answer 2

我嘗試了以下代碼，它對我有用。

base_url = 'https://www.cnn.com/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'
}
r = requests.get(base_url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())

請注意，我在requests.get()指定了一個headers參數。 它所做的只是嘗試模仿真實的瀏覽器，以便反抓取算法無法檢測到它。
希望這會有所幫助，如果沒有，請隨時在評論中問我。 干杯:)

Answer 3

我剛查過。 CNN 似乎認識到您以編程方式嘗試抓取網站並提供 404/缺失頁面（上面沒有內容）而不是主頁。

嘗試像Selenium這樣的無頭瀏覽器，例如：

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://cnn.com')
html = driver.page_source

請求無法正常工作的網頁抓取

問題描述

3 個解決方案

解決方案1
2 已采納 2019-12-14 09:07:48

解決方案2
1 2019-12-12 21:30:13

解決方案3
0 2019-12-12 21:15:10

請求無法正常工作的網頁抓取

問題描述

3 個解決方案

解決方案1 2 已采納 2019-12-14 09:07:48

解決方案2 1 2019-12-12 21:30:13

解決方案3 0 2019-12-12 21:15:10

解決方案1
2 已采納 2019-12-14 09:07:48

解決方案2
1 2019-12-12 21:30:13

解決方案3
0 2019-12-12 21:15:10