[英]Web Scraping in Python with BeautifulSoup
我是抓取的新手,我一直在抓取包含我想提取的一些引号的网页。
您能否还检查将抓取的数据复制到 CSV 文件的代码?
import requests
from bs4 import BeautifulSoup
import csv
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
quotes=[] # a list to store quotes
table = soup.find('div', attrs = {'id':'container'})
for row in table.findAll('div', attrs = {'class':'quote'}):
quote = {}
quote['theme'] = row.h5.text
quote['url'] = row.a['href']
quote['img'] = row.img['src']
quote['lines'] = row.h6.text
quote['author'] = row.p.text
quotes.append(quote)
filename = 'inspirational_quotes.csv'
with open(filename, 'wb') as f:
w = csv.DictWriter(f,['theme','url','img','lines','author'])
w.writeheader()
for quote in quotes:
w.writerow(quote)
我在"findAll"
函数中遇到错误。
for row in table.findAll('div', attrs = {'class':'quote'}):
AttributeError: 'NoneType' object has no attribute 'findAll
该站点的 html 与您在脚本中定义的不同。 我已经纠正了前三个字段。 我想你可以做剩下的。 以下应该适合您。
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://www.passiton.com/inspirational-quotes?page={}"
quotes = []
page = 1
while True:
r = requests.get(URL.format(page))
print(r.url)
soup = BeautifulSoup(r.content, 'html5lib')
if not soup.select_one("#all_quotes .text-center > a"):break
for row in soup.select("#all_quotes .text-center"):
quote = {}
try:
quote['quote'] = row.select_one('a img.shadow').get("alt")
except AttributeError: quote['quote'] = ""
try:
quote['url'] = row.select_one('a').get('href')
except AttributeError: quote['url'] = ""
try:
quote['img'] = row.select_one('a img.shadow').get('src')
except AttributeError: quote['img'] = ""
quotes.append(quote)
page+=1
with open('inspirational_quotes.csv', 'w', newline="", encoding="utf-8") as f:
w = csv.DictWriter(f,['quote','url','img'])
w.writeheader()
for quote in quotes:
w.writerow(quote)
该站点没有任何带有属性 id: container 的 div 标签。 您可以使用报价 API
from requests import get
url='https://quote-garden.herokuapp.com/quotes/random'
res=get(url)
res=res.json()
quote=res["quoteText"]
quoteauthor=res["quoteAuthor"]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.