![](/img/trans.png)
[英]BeautifulSoup - Scraping data through paginated table using Python
[英]Scraping paginated results using python beautifulsoup 3
我能夠為首頁和最后一頁編寫代碼,但只能提取CSV中的第1頁數據。 我需要將所有10頁數據提取到CSV中。 我在代碼哪里出錯了?
導入已安裝的模塊
import requests
from bs4 import BeautifulSoup
import csv
要從網頁獲取數據,我們將使用請求get()方法
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
檢查http響應狀態碼
print(page.status_code)
現在我已經從網頁上收集了數據,讓我們看看我們得到了什么
print(page.text)
可以使用beautifulsoup的prettify()方法以漂亮的格式查看以上數據。 為此,我們將創建一個bs4對象並使用prettify方法
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])
查找包含公司信息的所有DIV
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
提取第一頁和最后一頁
paging = soup.find("div",{"class":"pg-full-width me-pagination"}).find("ul",{"class":"pagination"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text
現在遍歷這些元素
for element in product_name_list:
取1塊“ div”,{“ class”:“ CompanyInfo”}標記並查找/存儲名稱,地址,電話
name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()
將名稱,地址,電話寫入CSV
writer.writerow([name, address, phone])
現在將轉到下一個“ div”,{“ class”:“ CompanyInfo”}標簽並重復
outfile.close()
您將需要更多的循環。 您現在需要循環瀏覽每個頁面的網址:請參見下文。
import requests
from bs4 import BeautifulSoup
import csv
root_url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
html = requests.get(root_url)
soup = BeautifulSoup(html.text, 'html.parser')
paging = soup.find("div",{"class":"pg-full-width me-pagination"}).find("ul",{"class":"pagination"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])
pages = list(range(1,int(last_page)+1))
for page in pages:
url = 'https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore&page=%s' %(page)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
#print(soup.prettify())
print ('Processing page: %s' %(page))
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
for element in product_name_list:
name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()
writer.writerow([name, address, phone])
outfile.close()
print ('Done')
您還應該使用頁面屬性,例如https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore& page = 2
10頁的示例代碼:
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore&page={}"
for page_num in range(1, 10):
page = requests.get(url.format(page_num)
#further processing
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.