简体   繁体   中英

Fetching large paginated data from a REST API with Python

I'm pulling data from a rest API. The problem is that the datasize is huge and so the response is paginated. I've gotten around it by first reading how many pages of data there are and then iterating the request for each page. The only problem here is that the total number of pages are around 1.5K, which take a huge amount of time to actually fetch and append to a CSV. Is there any faster workaround for this?

This is the endpoint I'm targeting: https://developer.keeptruckin.com/reference#get-logs

import requests
import json
import csv
url='https://api.keeptruckin.com/v1/logs?start_date=2019-03-09'
header={'x-api-key':'API KEY HERE'}
r=requests.get(url,headers=header)
result=r.json()
result = json.loads(r.text)
num_pages=result['pagination']['total']
print(num_pages)
for page in range (2,num_pages+1):
    r=requests.get(url,headers=header, params={'page_no': page})
    result=r.json()
    result = json.loads(r.text)
    csvheader=['First Name','Last Name','Date','Time','Type','Location']
    with open('myfile.csv', 'a+', newline='') as csvfile:
        writer = csv.writer(csvfile, csv.QUOTE_ALL)
        ##writer.writerow(csvheader)
        for log in result['logs']:
            username = log['log']['driver']['username']
            first_name=log['log']['driver']['first_name']
            last_name=log['log']['driver']['last_name']
            for event in log['log']['events']:
                start_time = event['event']['start_time']
                date, time = start_time.split('T')
                event_type = event['event']['type']
                location = event['event']['location']
                if not location:
                    location = "N/A"
                if (username=="barmx1045"  or username=="aposx001" or username=="mcqkl002" or username=="coudx014" or username=="ruscx013" or username=="loumx001" or username=="robkr002" or username=="masgx009"or username=="coxed001" or username=="mcamx009" or username=="linmx024" or username=="woldj002" or username=="fosbl004"):
                    writer.writerow((first_name, last_name,date, time, event_type, location))

A first option : Most paginated responses have a page size that you can edit. https://developer.keeptruckin.com/reference#pagination Try updating the per_page field to 100 as opposed to the default 25 per pull.

A second option : potentially you can pull more than one page at a time by using multiple threads/processes and splitting up what portion of pages each is responsible for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM