Webscrape CNN，注入，美湯，python，請求，HTML

Question

好吧，我以為我瘋了，因為我在這方面屢次失敗，但我想，也許 html 發生了一些我不明白的事情。

我一直在嘗試從 cnn.com 中抓取“文章”。

但無論我用 class 標簽、id 等嘗試了 soup.find_all('articles') 或 soup.find('body').div('div')...等，都失敗了。

我找到了這個參考： Webscraping from React web application after componentDidMount 。

我懷疑 html 中的注入是我遇到問題的原因。

我從網絡安全閱讀中知道除了“html 注入攻擊”之外的 0 注入。

我想要這些文章，但我假設我需要使用類似於上面其他堆棧溢出問題鏈接的策略。 我不知道怎么。 幫助文檔或特別是 cnn 抓取的鏈接將不勝感激。

或者，如果有人知道我如何獲得 html 主體元素的“完整數據”，那么我可以在我的早期代碼中重新排列這個定義，然后重新分配主體。

“或者只是告訴我我是個白痴，走錯了路”

def build_art_d(site):
            
    url = site
    main_l = len(url)
    
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    

    print(soup.prettify())
    
    art_dict = {}
    
    body = soup.find('body')
    print(body.prettify())
    div1 = body.find('div', {'class':'pg-no-rail pg-wrapper'})
    section = div1.find('section',{'id' : 'homepage1-zone-1'})
    div2 = section.find('div', {'class':'l-container'})
    div3 = div2.find('div', {'class':'zn__containers'})
    articles = div3.find_all('article')
    
    for art in articles:
        art_dict[art.text] = art.href
    
        
    #test print
    for article in art_dict:
        print('Article :: {}'.format(article), 'Link :: {}'.format(art_dict[article]))

Answer 1

您可以使用 selinium 啟用由站點 javascript 填寫的數據。 然后使用您現有的 bs4 代碼來抓取文章。

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.cnn.com/')

soup = BeautifulSoup(driver.page_source, 'html.parser')

Webscrape CNN，注入，美湯，python，請求，HTML

問題描述

1 個解決方案

解決方案1
0 已采納 2021-01-11 02:00:02

Webscrape CNN，注入，美湯，python，請求，HTML

問題描述

1 個解決方案

解決方案1 0 已采納 2021-01-11 02:00:02

解決方案1
0 已采納 2021-01-11 02:00:02