Webscrape CNN，注入，美汤，python，请求，HTML

Question

好吧，我以为我疯了，因为我在这方面屡次失败，但我想，也许 html 发生了一些我不明白的事情。

我一直在尝试从 cnn.com 中抓取“文章”。

但无论我用 class 标签、id 等尝试了 soup.find_all('articles') 或 soup.find('body').div('div')...等，都失败了。

我找到了这个参考： Webscraping from React web application after componentDidMount 。

我怀疑 html 中的注入是我遇到问题的原因。

我从网络安全阅读中知道除了“html 注入攻击”之外的 0 注入。

我想要这些文章，但我假设我需要使用类似于上面其他堆栈溢出问题链接的策略。 我不知道怎么。 帮助文档或特别是 cnn 抓取的链接将不胜感激。

或者，如果有人知道我如何获得 html 主体元素的“完整数据”，那么我可以在我的早期代码中重新排列这个定义，然后重新分配主体。

“或者只是告诉我我是个白痴，走错了路”

def build_art_d(site):
            
    url = site
    main_l = len(url)
    
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    

    print(soup.prettify())
    
    art_dict = {}
    
    body = soup.find('body')
    print(body.prettify())
    div1 = body.find('div', {'class':'pg-no-rail pg-wrapper'})
    section = div1.find('section',{'id' : 'homepage1-zone-1'})
    div2 = section.find('div', {'class':'l-container'})
    div3 = div2.find('div', {'class':'zn__containers'})
    articles = div3.find_all('article')
    
    for art in articles:
        art_dict[art.text] = art.href
    
        
    #test print
    for article in art_dict:
        print('Article :: {}'.format(article), 'Link :: {}'.format(art_dict[article]))

Answer 1

您可以使用 selinium 启用由站点 javascript 填写的数据。 然后使用您现有的 bs4 代码来抓取文章。

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.cnn.com/')

soup = BeautifulSoup(driver.page_source, 'html.parser')

Webscrape CNN，注入，美汤，python，请求，HTML

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-01-11 02:00:02

Webscrape CNN，注入，美汤，python，请求，HTML

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-01-11 02:00:02

解决方案1
0 已采纳 2021-01-11 02:00:02