[英]Web Scrapping & BeautifulSoup - Next Page parsing
我只是在學習網頁抓取並希望將本網站的結果輸出到 csv 文件https://www.avbuyer.com/aircraft/private-jets
但我正在努力解析下一頁,這是我的代碼(在 Amen Aziz 的幫助下),它只給了我第一頁
我正在使用 Chrome 所以不確定它是否有任何不同我正在運行 Python 3.8.12
先感謝您
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.avbuyer.com/aircraft/private-jets')
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_ = 'listing-item premium')
temp=[]
for post in postings:
link = post.find('a', class_ = 'more-info').get('href')
link_full = 'https://www.avbuyer.com'+ link
plane = post.find('h2', class_ = 'item-title').text
price = post.find('div', class_ = 'price').text
location = post.find('div', class_ = 'list-item-location').text
desc = post.find('div', class_ = 'list-item-para').text
try:
tag = post.find('div', class_ = 'list-viewing-date').text
except:
tag = 'N/A'
updated = post.find('div', class_ = 'list-update').text
t=post.find_all('div',class_='list-other-dtl')
for i in t:
data=[tup.text for tup in i.find_all('li')]
years=data[0]
s=data[1]
total_time=data[2]
temp.append([plane,price,location,years,s,total_time,desc,tag,updated,link_full])
df=pd.DataFrame(temp,columns=["plane","price","location","Year","S/N","Totaltime","Description","Tag","Last Updated","link"])
next_page = soup.find('a', {'rel':'next'}).get('href')
next_page_full = 'https://www.avbuyer.com'+next_page
next_page_full
url = next_page_full
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
df.to_csv('/Users/xxx/avbuyer.csv')
試試這個:如果你想要cvs file
那么你完成print(df)
並使用df.to_csv("prod.csv")
我寫在代碼中以獲取 csv 文件
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
temp=[]
for page in range(1, 5):
response = requests.get("https://www.avbuyer.com/aircraft/private-jets/page-{page}".format(page=page),headers=headers,)
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_='listing-item premium')
for post in postings:
link = post.find('a', class_='more-info').get('href')
link_full = 'https://www.avbuyer.com' + link
plane = post.find('h2', class_='item-title').text
price = post.find('div', class_='price').text
location = post.find('div', class_='list-item-location').text
t=post.find_all('div',class_='list-other-dtl')
for i in t:
data=[tup.text for tup in i.find_all('li')]
years=data[0]
s=data[1]
total_time=data[2]
temp.append([plane,price,location,link_full,years,s,total_time])
df=pd.DataFrame(temp,columns=["plane","price","location","link","Years","S/N","Totaltime"])
print(df)
#df.to_csv("prod.csv")
輸出
plane ... Totaltime
0 Gulfstream G280 ... Total Time 2528
1 Dassault Falcon 2000LXS ... Total Time 33
2 Cirrus Vision SF50 G1 ... Total Time 615
3 Gulfstream IV ... Total Time 6425
4 Gulfstream G280 ... Total Time 1918
.. ... ... ...
75 Boeing 737 35B ... Total Time 38605.7
76 Bombardier Global 6000 ... Total Time 5129
77 Dassault Falcon 2000 ... Total Time 11731
78 Bombardier Learjet 45 ... Total Time 8420
79 Dassault Falcon 2000 ... Total Time 6760
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.