简体   繁体   English

使用循环抓取网站时丢失数据

[英]Missing data when scraping website using loop

Beginner coder here trying to do a simple website scrape. 初学者在这里尝试做一个简单的网站抓取。 I want to pull item attributes from multiple pages of a search result. 我想从搜索结果的多个页面中提取项目属性。 I can do that, but my issue is that items towards the end of each page seem to be missing. 我可以这样做,但是我的问题是,每页末尾的项目似乎都丢失了。

Is it a simple error with my loop/counters? 这是我的循环/计数器的简单错误吗? The example code below should just be printing x as the xth search result. 下面的示例代码应该只是将x打印为第x个搜索结果。

import requests
from bs4 import BeautifulSoup
import xlwt

headers = {'user-agent': 'Mozilla/5.0'}
pagelimit = 60 #number of results on page
startoffset = 0 #starting offset  (no. of items


def extract(soup,count):
    x = count
    for div in soup.findAll("div", "result-item standard"):
        print(x)
        x = x+1

offset = startoffset
count = 1
for i in range(0,10):
    url = "http://www.carsales.com.au/cars/results?offset=" + \
    str(offset) + \
    "&q=%28Service%3D%5BCarsales%5D%26%28%28SiloType%3D%5BDealer%20" + \
    "used%20cars%5D%7CSiloType%3D%5BDemo%20and%20near%20new%" + \
    "20cars%5D%29%7CSiloType%3D%5BPrivate%20seller%20cars%5D%29%29" + \
    "&sortby=~Price&limit=" + \
    str(pagelimit) + "&cpw=1"

    r = requests.get(url, headers)
    soup = BeautifulSoup(r.text, "html.parser")
    extract(soup,count)

    offset = str(i*pagelimit+int(startoffset))
    count = count + pagelimit

Your code makes two assumptions which might lead to missing results 您的代码有两个假设,可能会导致结果丢失

The first assumption is that every page returns the maximum number of results ( pagelimit ) - which the final page is unlikely to. 第一个假设是每个页面返回的最大结果数( pagelimit )-最后一页不太可能。 You should have the extract method return the final value of x: 您应该让extract方法返回x的最终值:

def extract(soup,count):
    x = count
    for div in soup.findAll("div", "result-item standard"):
        print(x)
        x = x+1
    return x

Then you should replace count = count + pagelimit with something like count = extract(soup,count) 然后,您应该将count = count + pagelimit替换count = count + pagelimit count = extract(soup,count)

You can also then use this number to set the offset. 然后,您也可以使用该数字设置偏移量。

The second assumption is that there will always be at least 10 pages of cars. 第二个假设是,总会有至少10页的汽车。 If there are less than 10 pages, your code may behave strangely when you loop beyond the end of the results list. 如果少于10页,则循环超出结果列表的末尾时,代码的行为可能会异常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM