简体   繁体   中英

How to scrape headline news, link and image?

I'd like to scrape news headline, link of news and picture of that news.

在此处输入图像描述

I try to use web scraping following as below. but It's only headline code and It is not work.

import requests
import pandas as pd
from bs4 import BeautifulSoup

nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False)
soup = BeautifulSoup(res.content, 'html.parser')

headlines = soup.find_all('h2',{'class':'post-title-news'})
len(headlines)
for i in range(len(headlines)):
    print(headlines[i].text)

Please recommend it to me.

This is because the site blocks bot. If you print the res.content which shows 403.

Add headers={'User-Agent':'Mozilla/5.0'} to the request.

Try the code below,

nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False, headers={'User-Agent':'Mozilla/5.0'})

soup = BeautifulSoup(res.content, 'html.parser')

headlines = soup.find_all('h2', class_='post-title-news')
print(len(headlines))
for i in range(len(headlines)):
   print(headlines[i].text)

First things first: never post code as an image .


<h2> in your HTML has no text . What it does have, is an <a> element, so:

 for hl in headlines:
     link = hl.findChild()
     text = link.text
     url = link.attrs['href']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM