[英]Python - scraping a paginated site and writing the results to a file
我是一個完整的編程初學者,所以如果我不能很好地表達我的問題,請見諒。 我正在嘗試編寫一個腳本,該腳本將瀏覽一系列新聞頁面並記錄文章標題及其鏈接。 我已經設法完成了第一頁,問題是獲取后續頁面的內容。 通過在 stackoverflow 中搜索,我想我設法找到了一種解決方案,可以讓腳本訪問多個 URL 但它似乎覆蓋了從它訪問的每個頁面中提取的內容,所以我總是以相同數量的記錄文章結束文件。 可能有幫助的內容:我知道 URL 遵循以下模型:“/ultimas/?page=1”、“/ultimas/?page=2”等,並且它似乎使用 AJAX 請求新文章
這是我的代碼:
import csv
import requests
from bs4 import BeautifulSoup as Soup
import urllib
r = base_url = "http://agenciabrasil.ebc.com.br/"
program_url = base_url + "/ultimas/?page="
for page in range(1, 4):
url = "%s%d" % (program_url, page)
soup = Soup(urllib.urlopen(url))
letters = soup.find_all("div", class_="titulo-noticia")
letters[0]
lobbying = {}
for element in letters:
lobbying[element.a.get_text()] = {}
letters[0].a["href"]
prefix = "http://agenciabrasil.ebc.com.br"
for element in letters:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
for item in lobbying.keys():
print item + ": " + "\n\t" + "link: " + lobbying[item]["link"] + "\n\t"
import os, csv
os.chdir("...")
with open("lobbying.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["name", "link",])
for a in lobbying.keys():
writer.writerow([a.encode("utf-8"), lobbying[a]["link"]])
import json
with open("lobbying.json", "w") as writeJSON:
json.dump(lobbying, writeJSON)
print "Fim"
任何關於如何將每個頁面的內容添加到最終文件的幫助將不勝感激。 謝謝!
如果服務於相同的目的,這個怎么樣:
import csv, requests
from lxml import html
base_url = "http://agenciabrasil.ebc.com.br"
program_url = base_url + "/ultimas/?page={0}"
outfile = open('scraped_data.csv', 'w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Caption","Link"])
for url in [program_url.format(page) for page in range(1, 4)]:
response = requests.get(url)
tree = html.fromstring(response.text)
for title in tree.xpath("//div[@class='noticia']"):
caption = title.xpath('.//span[@class="field-content"]/a/text()')[0]
policy = title.xpath('.//span[@class="field-content"]/a/@href')[0]
writer.writerow([caption , base_url + policy])
看起來您的 for 循環中的代碼( for page in range(1, 4):
)沒有被調用,因為您的文件沒有正確縮進:
如果你整理你的代碼,它的工作原理:
import csv, requests, os, json, urllib
from bs4 import BeautifulSoup as Soup
r = base_url = "http://agenciabrasil.ebc.com.br/"
program_url = base_url + "/ultimas/?page="
for page in range(1, 4):
url = "%s%d" % (program_url, page)
soup = Soup(urllib.urlopen(url))
letters = soup.find_all("div", class_="titulo-noticia")
lobbying = {}
for element in letters:
lobbying[element.a.get_text()] = {}
prefix = "http://agenciabrasil.ebc.com.br"
for element in letters:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
for item in lobbying.keys():
print item + ": " + "\n\t" + "link: " + lobbying[item]["link"] + "\n\t"
#os.chdir("...")
with open("lobbying.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["name", "link",])
for a in lobbying.keys():
writer.writerow([a.encode("utf-8"), lobbying[a]["link"]])
with open("lobbying.json", "w") as writeJSON:
json.dump(lobbying, writeJSON)
print "Fim"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.