简体   繁体   English

Webscrape CNN,注入,美汤,python,请求,HTML

[英]Webscrape CNN, injection, beautiful soup, python, requests, HTML

Okay, I thought I was crazy because I repeatedly failed at this, but I thought, maybe something is happening with the html that I don't understand.好吧,我以为我疯了,因为我在这方面屡次失败,但我想,也许 html 发生了一些我不明白的事情。

I have been trying to scrape the 'articles' from cnn.com.我一直在尝试从 cnn.com 中抓取“文章”。

But no matter which way I tried soup.find_all('articles'), or soup.find('body').div('div')...etc with class tags, id, etc. FAIL.但无论我用 class 标签、id 等尝试了 soup.find_all('articles') 或 soup.find('body').div('div')...等,都失败了。

I found this reference: Webscraping from React web application after componentDidMount .我找到了这个参考: Webscraping from React web application after componentDidMount

I suspect injection in html is why I am having issues.我怀疑 html 中的注入是我遇到问题的原因。

I know 0 about injection other than 'html injection attacks' from cyber security reading.我从网络安全阅读中知道除了“html 注入攻击”之外的 0 注入。

I want the articles, but I am assuming I will need to use a tactic similar to the other stack overflow question link above.我想要这些文章,但我假设我需要使用类似于上面其他堆栈溢出问题链接的策略。 I do not know how.我不知道怎么。 Links to help documents or specifically cnn scraping would be appreciated.帮助文档或特别是 cnn 抓取的链接将不胜感激。

Or if someone knows how I could get the 'full data' of the html body element, so that I could do some rearranging in my early code of this definition and then just reassign body.或者,如果有人知道我如何获得 html 主体元素的“完整数据”,那么我可以在我的早期代码中重新排列这个定义,然后重新分配主体。

'Or just tell me I'm an idiot and on the wrong track' “或者只是告诉我我是个白痴,走错了路”

def build_art_d(site):
            
    url = site
    main_l = len(url)
    
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    

    print(soup.prettify())
    
    art_dict = {}
    
    body = soup.find('body')
    print(body.prettify())
    div1 = body.find('div', {'class':'pg-no-rail pg-wrapper'})
    section = div1.find('section',{'id' : 'homepage1-zone-1'})
    div2 = section.find('div', {'class':'l-container'})
    div3 = div2.find('div', {'class':'zn__containers'})
    articles = div3.find_all('article')
    
    for art in articles:
        art_dict[art.text] = art.href
    
        
    #test print
    for article in art_dict:
        print('Article :: {}'.format(article), 'Link :: {}'.format(art_dict[article]))

You can use selinium to enable the data to be filled in by the sites javascript.您可以使用 selinium 启用由站点 javascript 填写的数据。 Then use your existing bs4 code to scrape the articles.然后使用您现有的 bs4 代码来抓取文章。

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.cnn.com/')

soup = BeautifulSoup(driver.page_source, 'html.parser')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM