如何使用 GitHub 搜索 API 获取全部 1000 个结果？

Question

I understand that the GitHub Search API limits to 1000 results and 100 results per page.我了解 GitHub Search API 限制为 1000 个结果，每页 100 个结果。 Therefore I wrote the following to view all 1000 results for a code search process that looks for a string torch -因此，我编写了以下代码来查看查找字符串torch的代码搜索过程的所有 1000 个结果 -

import requests
for i in range(1,11):
    url = "https://api.github.com/search/code?q=torch +in:file + language:python&per_page=100&page="+str(i)

    headers = {
    'Authorization': 'xxxxxxxx'
    }

    response = requests.request("GET", url, headers=headers).json()
    try:
        print(len(response['items']))
    except:
        print("response = ", response)

Here is the output -这是输出 -

15
62
response =  {'documentation_url': 'https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits', 'message': 'You have exceeded a secondary rate limit. Please wait a few minutes before you try again.'}

It seems to hit the secondary rate limit just after the second iteration似乎在第二次迭代后就达到了二次速率限制
The values in the pages aren't consistent.页面中的值不一致。 For instance, page 1 shows 15 results when I ran this time.比如我这次跑的时候第1页显示了15个结果。 However, if I run it again, it will be another number.但是，如果我再次运行它，它将是另一个数字。 I believe there should be 100 results per page.我相信每页应该有 100 个结果。

Does there exist an efficient way to get all 1000 results from the Search API?是否存在从搜索 API 获取所有 1000 个结果的有效方法？

Answer 1

There's two things happening here:这里发生了两件事：

You are receiving incomplete results because the query is timing out.您收到的结果不完整，因为查询超时。
You are being rate limited.您受到速率限制。

The search API has different rate limits.搜索 API 有不同的速率限制。 See the GitHub Documentation :请参阅GitHub 文档：

The REST API for searching items has a custom rate limit that is separate from the rate limit governing the other REST API endpoints.用于搜索项目的 REST API 具有自定义速率限制，该速率限制独立于管理其他 REST API 端点的速率限制。

I would recommend trying lower amounts of results per page to solve the incomplete results.我建议尝试每页使用较少数量的结果来解决不完整的结果。

You will also need to be very deliberate about the requests you're making, because the limits are low.您还需要非常慎重地考虑您提出的要求，因为限制很低。 Getting the full 1000 may be impossible without requesting a rate increase or a implementing a very long backoff.如果不请求提高速率或实施很长的退避期，可能无法获得全部 1000。

I modified your code to add a primitive exponential backoff, but this still doesn't produce the full 1000 results and takes a while:我修改了您的代码以添加原始指数退避，但这仍然不会产生完整的 1000 个结果并且需要一段时间：

import requests
import time

headers = {
'Authorization': 'token <TOKEN>'
}

results = []
for i in range(1, 31):
    url = "https://api.github.com/search/code?q=torch +in:file + language:python&per_page=33&page="+str(i)
    backoff = 2 # backoff in seconds
    while backoff < 1024:
        time.sleep(backoff)
        try:
            response = requests.request("GET", url, headers=headers)
            response.raise_for_status() # throw an exception for HTTP 400 and 500s
            data = response.json()
            results.append(data['items'])
            print(f'Got {len(data["items"])} results for page {i}.')
            url = response.links['next']['url']
            break
        except requests.exceptions.RequestException as e:
            print('ERROR: Failed to make request: ', e)
            backoff **= 2
    if backoff >= 1024:
        print('ERROR: Backoff limit reached.')
        break

如何使用 GitHub 搜索 API 获取全部 1000 个结果？

问题描述

1 个解决方案

解决方案1
1 2022-12-22 02:21:33

如何使用 GitHub 搜索 API 获取全部 1000 个结果？

问题描述

1 个解决方案

解决方案1 1 2022-12-22 02:21:33

解决方案1
1 2022-12-22 02:21:33