简体   繁体   English

Python - 如何通过循环分页 API 来提取数据(Harvest)

[英]Python - how to extract data by looping through paginated API (Harvest)


First of all I have been working with Python for about a couple of days, so I don't necessarily know the best practices or all the terminology ... yet.首先,我已经使用 Python 工作了大约几天,所以我不一定知道最佳实践或所有术语...... I learn best by reverse engineering and my code below is based on the official documentation from Harvest and other bits I've found with google-fu我通过逆向工程学习得最好,我下面的代码基于 Harvest 的官方文档和我在 google-fu 上找到的其他位

My request is to download all the time entries records from Harvest and save as a JSON (or ideally a CSV file).我的请求是从 Harvest 下载所有时间条目记录并保存为 JSON(或最好是 CSV 文件)。

Official Python Example from Harvest Git Hub 来自 Harvest Git Hub 的官方 Python 示例

This is my adapted code (including all outputs, which won't be neccessary in the final code but handy for my learning):这是我改编的代码(包括所有输出,在最终代码中不是必需的,但对我的学习很方便):

import requests, json, urllib.request

#Set variables for authorisation
AUTH = "REDACTED"
ACCOUNT = "REDACTED"

URL = "https://api.harvestapp.com/v2/time_entries"
HEADERS = { "Authorization": AUTH,
            "Harvest-Account-ID": ACCOUNT}
PAGENO = str("5")

request = urllib.request.Request(url=URL+"?page="+PAGENO, headers=HEADERS)
response = urllib.request.urlopen(request, timeout=5)
responseBody = response.read().decode("utf-8")
jsonResponse = json.loads(responseBody)

# Find the values for pagination
parsed = json.loads(responseBody)
links_first = parsed["links"]["first"]
links_last = parsed["links"]["last"]
links_next = parsed["links"]["next"]
links_previous = parsed["links"]["previous"]
nextpage = parsed["next_page"]
page = parsed["page"]
perpage = parsed["per_page"]
prevpage = parsed["previous_page"]
totalentries = parsed["total_entries"]
totalpages = parsed["total_pages"]

#Print the output
print(json.dumps(jsonResponse, sort_keys=True, indent=4))
print("first link : " + links_first)
print("last link : " + links_last)
print("next page : " + str(nextpage))
print("page : " + str(page))
print("per page : " + str(perpage))
print("total records : " + str(totalentries))
print("total pages : " + str(totalpages))

The output response is输出响应是
"Squeezed text (5816 lines)" “压缩文本(5816 行)”
first link : https://api.harvestapp.com/v2/time_entries?page=1&per_page=100第一个链接: https : //api.harvestapp.com/v2/time_entries?page=1&per_page=100
last link : https://api.harvestapp.com/v2/time_entries?page=379&per_page=100最后一个链接: https : //api.harvestapp.com/v2/time_entries?page=379&per_page=100
next page : 6下一页:6
page : 5第 5 页
per page : 100每页:100
total records : 37874总记录数:37874
total pages : 379总页数:379

Please can someone advise the best way to loop through the pages to form one JSON file ?请有人建议循环浏览页面以形成一个 JSON 文件的最佳方法吗? If you are also able to advise the best way then output that JSON file I would be very grateful.如果您也能够提供最佳方式的建议,然后输出该 JSON 文件,我将不胜感激。

I have been using the following code to retrieve all time entries.我一直在使用以下代码来检索所有时间条目。 It could be a bit more effective, perhaps, but it works.也许它可能更有效一点,但它确实有效。 The function get_all_time_entries loops through all the pages and appends the response in JSON format into all_time_entries array and finally returns this array.函数 get_all_time_entries 遍历所有页面并将 JSON 格式的响应附加到 all_time_entries 数组中,最后返回这个数组。

import requests
import json

def get_all_time_entries():

    url_address = "https://api.harvestapp.com/v2/time_entries"  
    headers = {
        "Authorization": "Bearer " + "xxxxxxxxxx",
        "Harvest-Account-ID": "xxxxxx"
    }

    # find out total number of pages
    r = requests.get(url=url_address, headers=headers).json()
    total_pages = int(r['total_pages'])

    # results will be appended to this list
    all_time_entries = []

    # loop through all pages and return JSON object
    for page in range(1, total_pages):

        url = "https://api.harvestapp.com/v2/time_entries?page="+str(page)              
        response = requests.get(url=url, headers=headers).json()        
        all_time_entries.append(response)       
        page += 1

    # prettify JSON
    data = json.dumps(all_time_entries, sort_keys=True, indent=4)

    return data

print(get_all_time_entries())

You can easily direct the output of the script with ">" to local folder when running in powershell, etc.在 powershell 等中运行时,您可以轻松地使用“>”将脚本的输出定向到本地文件夹。

For example:例如:

Python.exe example.py > C:\\temp\\all_time_entries.json Python.exe example.py > C:\\temp\\all_time_entries.json

Hope this helps!希望这可以帮助!

There's a Python library that supports Harvest API v2.有一个支持 Harvest API v2 的 Python 库。

The library supports all of the authentication methods, request rate limiting, response codes and has dataclasses for each of the response objects.该库支持所有身份验证方法、请求速率限制、响应代码,并为每个响应对象提供数据类。

The library is very well tested so you will have an example of usage for each endpoint in the tests.该库经过了很好的测试,因此您将在测试中获得每个端点的使用示例。 The tests use the official Harvest examples.测试使用官方 Harvest 示例。

Additionally there is an example detailed time report which inherits the Harvest object.此外,还有一个示例详细时间报告,它继承了 Harvest 对象。 The tests for the detailed time report show how to use it.详细时间报告的测试显示了如何使用它。

The library is referenced from the Harvest software directory;该库是从 Harvest 软件目录中引用的; https://www.getharvest.com/integrations/python-library https://www.getharvest.com/integrations/python-library

Project URL;项目网址; https://github.com/bradbase/python-harvest_apiv2 https://github.com/bradbase/python-harvest_apiv2

I own the project.我拥有该项目。


from harvest import Harvest
from .harvestdataclasses import *

class MyTimeEntries(Harvest):

    def __init__(self, uri, auth):
        super().__init__(uri, auth)


    def time_entries(self):
        time_entry_results = []
       
        time_entries = self.time_entries()
        time_entry_results.extend(time_entries.time_entries)
        if time_entries.total_pages > 1:
            for page in range(2, time_entries.total_pages + 1):
                time_entries = self.time_entries(page=page)
                time_entry_results.extend(time_entries.time_entries)
        
        return time_entry_results

personal_access_token = PersonalAccessToken('ACCOUNT_NUMBER', 'PERSONAL_ACCESS_TOKEN')
my_report = MyTimeEntries('https://api.harvestapp.com/api/v2', personal_access_token)
time_entries = my_report.time_entries()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM