簡體   English   中英

Web Scrapping & BeautifulSoup - 下一頁解析

[英]Web Scrapping & BeautifulSoup - Next Page parsing

我只是在學習網頁抓取並希望將本網站的結果輸出到 csv 文件https://www.avbuyer.com/aircraft/private-jets

但我正在努力解析下一頁,這是我的代碼(在 Amen Aziz 的幫助下),它只給了我第一頁
我正在使用 Chrome 所以不確定它是否有任何不同我正在運行 Python 3.8.12
先感謝您

import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.avbuyer.com/aircraft/private-jets')
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_ = 'listing-item premium')
temp=[]
for post in postings:
    link = post.find('a', class_ = 'more-info').get('href')
    link_full = 'https://www.avbuyer.com'+ link
    plane = post.find('h2', class_ = 'item-title').text
    price = post.find('div', class_ = 'price').text
    location = post.find('div', class_ = 'list-item-location').text
    desc = post.find('div', class_ = 'list-item-para').text
    try:
        tag = post.find('div', class_ = 'list-viewing-date').text
    except:
        tag = 'N/A'
    updated = post.find('div', class_ = 'list-update').text
    t=post.find_all('div',class_='list-other-dtl')
    for i in t:
        data=[tup.text for tup in i.find_all('li')]
        years=data[0]
        s=data[1]
        total_time=data[2]

        temp.append([plane,price,location,years,s,total_time,desc,tag,updated,link_full])

df=pd.DataFrame(temp,columns=["plane","price","location","Year","S/N","Totaltime","Description","Tag","Last Updated","link"])


next_page = soup.find('a', {'rel':'next'}).get('href')
next_page_full = 'https://www.avbuyer.com'+next_page
next_page_full

url = next_page_full
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml') 

df.to_csv('/Users/xxx/avbuyer.csv')

試試這個:如果你想要cvs file那么你完成print(df)並使用df.to_csv("prod.csv")我寫在代碼中以獲取 csv 文件

import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
temp=[]
for page in range(1, 5):
    response = requests.get("https://www.avbuyer.com/aircraft/private-jets/page-{page}".format(page=page),headers=headers,)
    soup = BeautifulSoup(response.content, 'html.parser')
    postings = soup.find_all('div', class_='listing-item premium')
    for post in postings:
        link = post.find('a', class_='more-info').get('href')
        link_full = 'https://www.avbuyer.com' + link
        plane = post.find('h2', class_='item-title').text
        price = post.find('div', class_='price').text
        location = post.find('div', class_='list-item-location').text
        t=post.find_all('div',class_='list-other-dtl')
        for i in t:
            data=[tup.text for tup in i.find_all('li')]
            years=data[0]
            s=data[1]
            total_time=data[2]
            temp.append([plane,price,location,link_full,years,s,total_time])

df=pd.DataFrame(temp,columns=["plane","price","location","link","Years","S/N","Totaltime"])
print(df)
#df.to_csv("prod.csv")

輸出

                       plane  ...           Totaltime
0           Gulfstream G280   ...     Total Time 2528
1   Dassault Falcon 2000LXS   ...       Total Time 33
2      Cirrus Vision SF50 G1  ...      Total Time 615
3             Gulfstream IV   ...     Total Time 6425
4           Gulfstream G280   ...     Total Time 1918
..                       ...  ...                 ...
75            Boeing 737 35B  ...  Total Time 38605.7
76   Bombardier Global 6000   ...     Total Time 5129
77     Dassault Falcon 2000   ...    Total Time 11731
78    Bombardier Learjet 45   ...     Total Time 8420
79     Dassault Falcon 2000   ...     Total Time 6760

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM