使用python抓取网页数据

Question

I m just started learning web scraping using Python.我刚开始使用 Python 学习网页抓取。 My aim is to web scrape the Realtime news for Bajaj Auto Ltd. from http://money.rediff.com/companies/Bajaj-Auto-Ltd/10540026 .我的目标是从http://money.rediff.com/companies/Bajaj-Auto-Ltd/10540026网上抓取 Bajaj Auto Ltd. 的实时新闻。

The problem: I'm unable to extract the contents(ie news).问题：我无法提取内容（即新闻）。

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'http://money.rediff.com/companies/Bajaj-Auto-Ltd/10540026'
data = urlopen(url)
soup = BeautifulSoup(data)

te=soup.find('a',attrs={'target':'_jbpinter'})
lis=te.find_all_next('a',attrs={'target':'_jbpinter'})
#print(lis)

for li in lis:
    print(li.find('a').contents[0])

I m getting the error "AttributeError: 'NoneType' object has no attribute 'contents'" And I does not get the desired result.我收到错误“AttributeError: 'NoneType' object has no attribute 'contents'” 我没有得到想要的结果。

Any input will be appreciated.任何输入将不胜感激。

Answer 1

You are trying to get the a tag twice.您正试图两次获取a标签。

Replace代替

for li in lis:
    print(li.find('a').contents[0])

with和

for li in lis:
    print(li.get_text())

and you get this output:你会得到这个输出：

Need Different Rates For Different Products: Rahul Bajaj on GST
Reforms irrespective of Bihar results: Bajaj
Auto shares in focus; Tata Motors up over 5%
We believe new Avenger will stimulate the market: Bajaj Auto's Eric Vas
BHP Billiton pins future of Indonesian coal mine on new...

使用python抓取网页数据

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-11-04 16:52:11

使用python抓取网页数据

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-11-04 16:52:11

解决方案1
1 已采纳 2015-11-04 16:52:11