繁体   English   中英

网络抓取时如何单步浏览网页

[英]How to Step Through Pages When Web-Scraping

我已经写了下面的代码来刮擦cargurus网站。 搜索显示每页15个条目。

我想迭代地从页面1移至n并刮擦每一页。 下面的代码应该这样做,但是在脚本结尾,我有一个数据帧df,它复制了第一页numPages次。

我以为代码没有给计算机足够的时间来接收请求,所以我添加了time.sleep(1)行,但这似乎不起作用。

我究竟做错了什么?

# Import Modules
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import seaborn as sns
import time

#Utility Functions
def to_number(s):
    #Convert to  Number
    numval = int(s.replace(',',''))
    return numval

def get_location(s):
    #Convert to  City, State (SS), and zip (string)
    s = s.replace(',','')
    sList = s.split()
    n = len(sList)-1
    City = ''
    for word in sList[0:n-1]:
        City += word  + ' '
    City = City[:-1]
    State = sList[n-1]
    Zip = sList[n]
    return City, State, Zip

def get_YearMakeModelTrim(s):
    #Convert to  Year, Make, Model, Trim
    sList = s.split()
    n = len(sList)-1
    Year = sList[0]
    Make = sList[1]
    Model = sList[2]
    if n == 3:
        Trim = sList[3]
    else:
        Trim = "None"
    return Year, Make, Model, Trim

numPages = 10

baseURL = 'https://www.cargurus.com/Cars/inventorylisting/viewDetailsFilterViewInventoryListing.action?sourceContext=forSaleTab_false_0&newSearchFromOverviewPage=true&inventorySearchWidgetType=AUTO&entitySelectingHelper.selectedEntity=c24578&entitySelectingHelper.selectedEntity2=c25202&zip=03062&distance=50000&searchChanged=true&modelChanged=false&filtersModified=true#resultsPage={}'


data = []
for ii in range(numPages):
    URL = baseURL.format(ii+1)
    print(URL)

    r  = requests.get(URL).text
    time.sleep(1)
    soup = bs(r,'html.parser')

    stats = soup.find_all("div", attrs = {"class": "cg-dealFinder-result-stats"})
    deals = soup.find_all("div", attrs = {"class": "cg-dealFinder-result-deal"})
    titles = soup.find_all("h4", {"class":"cg-dealFinder-result-model"})

    for title, deal, stat in zip(titles,deals,stats):
        row = {}
        row["Price"] = to_number(stat.find('span').get_text()[1:])
        row["Mileage"] = to_number(stat.find_all("p")[1].text[9:])
        row["City"],  row["State"], row["Zip"] = get_location(stat.find_all("p")[2].text[10:])
        row["natAvgPrice"] = to_number(deal.find('span', attrs = {'class': 'nationalAvg'}).get_text()[17:])
        row["Year"], row["Make"],  row["Model"], row["Trim"] = get_YearMakeModelTrim(title.find('span', attrs = {'itemprop': 'name'}).get_text())
        row["NewUsed"] = title.find('span', attrs = {'class': 'invisibleLayer'}).get_text()[:-5]
        data.append(row)

df = pd.DataFrame(data)
#df = df.drop_duplicates()

sns.pairplot(x_vars=["Mileage"], y_vars=["Price"], data=df, hue="Trim", size=5)

此页面使用JavaScript / AJAX从url读取数据

https://www.cargurus.com/zh_CN/Cars/inventorylisting/ajaxFetchSubs etetInventoryListing.a‌ction?sourceContext = ‌forSaleTab_false_0

它使用带有参数的POST请求,并且有参数page

from bs4 import BeautifulSoup
import requests

params = {
    'zip': '03062',
    'address': 'Nashua,+NH',
    'latitude': "42.73040008544922",
    'longitude': '-71.49479675292969',
    'distance': 50000,
    'selectedEntity': 'c24578',
    'entitySelectingHelper.selectedEntity2': 'c25202',
    'minPrice': '',
    'maxPrice': '', 
    'minMileage': '',   
    'maxMileage': '',   
    'transmission': 'ANY',
    'bodyTypeGroup': '',    
    'serviceProvider': '',  
    'page': 1,
    'filterBySourcesString': '',
    'filterFeaturedBySourcesString': '',
    'displayFeaturedListings': True,
    'searchSeoPageType': '',    
    'inventorySearchWidgetType': 'AUTO',
    'allYearsForTrimName': False,
    'daysOnMarketMin': '',  
    'daysOnMarketMax': '',
    'vehicleDamageCategoriesRaw': '',
    'minCo2Emission': '',
    'maxCo2Emission': '',
    'vatOnly': False,
    'minEngineDisplacement': '',
    'maxEngineDisplacement': '',
    'minMpg': '',
    'maxMpg': '',   
    'startYear': 2015,
    'endYear': 2016,
    'isRecentSearchView': False,
}

url = 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=forSaleTab_false_0'

display_keys = True

for x in range(1, 4):

    params['page'] = x

    response = requests.post(url, data=params)

    data = response.json()

    if display_keys:
        display_keys = False
        for key in data.keys():
            print('key:', key)
        for key in data['listings'][0].keys():
            print("data['listings'] key:", key)
        print('-----')

    print('--- offers number:', len( data['listings']), '---')
    for item in data['listings'][:10]:
        print(item['id'], data['modelName'], item['modelName'], item['trimName'])

结果-键

key: listings
key: modelName
key: styleSet
key: modelId
key: serviceProviders
key: page
key: sellers
key: remainingResults
data['listings'] key: bodyType
data['listings'] key: fleet
data['listings'] key: serviceProviderId
data['listings'] key: saved
data['listings'] key: highwayFuelEconomy
data['listings'] key: modelId
data['listings'] key: nonwholesaleSellerId
data['listings'] key: isFranchiseDealer
data['listings'] key: regressionPrice
data['listings'] key: rating
data['listings'] key: listedDate
data['listings'] key: dealerRatingPriceAdjustment
data['listings'] key: isOEMCPO
data['listings'] key: sellerId
data['listings'] key: transmission
data['listings'] key: mainPictureUrl
data['listings'] key: monthlyPayment
data['listings'] key: price
data['listings'] key: exteriorColorName
data['listings'] key: id
data['listings'] key: isFeatured
data['listings'] key: mileage
data['listings'] key: makeId
data['listings'] key: zip
data['listings'] key: noPhotos
data['listings'] key: isCertified
data['listings'] key: msrpString
data['listings'] key: engineCylinders
data['listings'] key: expectedPriceString
data['listings'] key: trimName
data['listings'] key: daysOnMarket
data['listings'] key: scaleMainPictureOnLoad
data['listings'] key: vehicleDamageCategory
data['listings'] key: monthlyPaymentString
data['listings'] key: isOutlier
data['listings'] key: cityFuelEconomy
data['listings'] key: savingsAmount
data['listings'] key: ownerCount
data['listings'] key: absoluteRating
data['listings'] key: salvage
data['listings'] key: contacted
data['listings'] key: priceString
data['listings'] key: distance
data['listings'] key: originalPrice
data['listings'] key: sellerRating
data['listings'] key: mileageString
data['listings'] key: engineType
data['listings'] key: wheelSystemDisplay
data['listings'] key: isDisplayConquestSection
data['listings'] key: serviceProviderName
data['listings'] key: carYear
data['listings'] key: savingsRecommendation
data['listings'] key: specificOptionIds
data['listings'] key: lemon
data['listings'] key: vehicleIdentifier
data['listings'] key: bodyTypeGroupId
data['listings'] key: useAnonymousContactEmail
data['listings'] key: msrp
data['listings'] key: sellerCity
data['listings'] key: bodyTypeGroupName
data['listings'] key: savingsArrowImage
data['listings'] key: dealScore
data['listings'] key: frameDamaged
data['listings'] key: hasAccidents
data['listings'] key: isCPO
data['listings'] key: expectedPrice
data['listings'] key: engineDisplacement
data['listings'] key: priceDifferentialString
data['listings'] key: trimLevelName
data['listings'] key: isNew
data['listings'] key: modelName
data['listings'] key: bodyTypeId
data['listings'] key: theftTitle
data['listings'] key: fuelType
data['listings'] key: maxSeating
data['listings'] key: wheelSystem
data['listings'] key: isConquestEnabled
data['listings'] key: autoEntityId
data['listings'] key: franchiseMake
data['listings'] key: optionIds
data['listings'] key: makeName
-----

结果-每个请求我只显示前10个项目(使用不同的page

--- offers number: 2000 ---
190057566 Honda Odyssey Odyssey Touring Elite
194518873 Honda Odyssey Odyssey 
184211547 Honda Odyssey Odyssey Touring Elite
185999601 Honda Odyssey Odyssey EX-L
191225205 Honda Odyssey Odyssey EX-L
192457272 Honda Odyssey Odyssey EX-L
190727203 Honda Odyssey Odyssey EX-L
189805101 Honda Odyssey Odyssey EX-L
190017310 Honda Odyssey Odyssey EX-L
185841600 Honda Odyssey Odyssey SE
--- offers number: 1985 ---
189574780 Honda Odyssey Odyssey EX-L
185923444 Honda Odyssey Odyssey EX-L
193088921 Honda Odyssey Odyssey Touring Elite
191861106 Honda Odyssey Odyssey EX-L
188361750 Honda Odyssey Odyssey Touring
185077447 Honda Odyssey Odyssey EX-L
182773821 Honda Odyssey Odyssey SE
189573553 Honda Odyssey Odyssey EX
191224649 Honda Odyssey Odyssey EX-L
179786502 Honda Odyssey Odyssey EX
--- offers number: 1970 ---
192649298 Honda Odyssey Odyssey Touring Elite
188612484 Honda Odyssey Odyssey EX-L
182338399 Honda Odyssey Odyssey EX
193159667 Honda Odyssey Odyssey EX-L
188979870 Honda Odyssey Odyssey EX-L
194311827 Honda Odyssey Odyssey EX
181047736 Honda Odyssey Odyssey EX-L
189115988 Honda Odyssey Odyssey EX-L
183408178 Honda Odyssey Odyssey EX-L
188950701 Honda Odyssey Odyssey EX-L

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM