简体   繁体   English

Python中用于API的API分页

[英]Pagination in Python for API from Solr

I am using Python to extract data from a Solr API, like so: 我使用Python从Solr API中提取数据,如下所示:

import requests

user = 'my_username'
password= 'my password'
url = 'my_url'

print ("Accessing API..")
req = requests.get(url = url, auth=(user, password))
print ("Accessed!")
out = req.json()
#print(out)

However, it looks like in some of the API URLs: the output is fairly "large" (many of the columns are lists of dictionaries), and so it doesn't return all rows, which are necessary. 但是,在某些API URL中看起来很像:输出相当“大”(许多列是字典列表),因此它不会返回所有必需的行。

From looking around, it looks like I should use pagination to bring in results in specified increments. 从环顾四周看,我应该使用分页以指定的增量引入结果。 Something like this: 像这样的东西:

url = 'url?start=0&rows=1000'

Then, 然后,

 url = 'url?start=1000&rows=1000'

and so on, until there is no result is returned. 依此类推,直到没有结果返回。

The way I am thinking about it is write a loop, and append result to output with every loop. 我正在考虑它的方式是编写一个循环,并将结果附加到每个循环的输出。 However, I am not sure how to do that. 但是,我不知道该怎么做。

Would someone be able to help please? 有人能帮忙吗?

Thank you in advance! 先感谢您!

Did you look at the output? 你看过输出了吗? In my experience, solr response usually includes a 'numFound' in it's result. 根据我的经验,solr响应通常在其结果中包含'numFound'。 On a (old) solr I have locally, doing a random query. 在一个(旧)solr我在本地,做一个随机查询。 I get this result. 我得到了这个结果。

{
  "responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "q": "*:*",
      "indent": "true",
      "start": "0",
      "rows": "10",
      "wt": "json",
      "_": "1509460751164"
    }
  },
  "response": {
    "numFound": 7023,
    "start": 0,
    "docs": [.. 10 docs]
    }
}

While working out this code example, I realized you don't need the numFound really. 在编写这个代码示例时,我意识到你真的不需要numFound Solr will just return any empty list for docs if there are no further results. 如果没有进一步的结果,Solr将返回任何docs空列表。 Making it easier to make the loop. 使循环更容易。

import requests

user = 'my_username'
password = 'my password'

# Starting values
start = 0
rows = 1000  # static, but easier to manipulate if it's a variable
base_url = 'my_url?rows={0}?start={1}'

url = base_url.format(rows, start)
req = requests.get(url=url, auth=(user, password))
out = req.json()

total_found = out.get('response', {}).get('numFound', 0)

# Up the start with 1000, so we fetch the next 1000
start += rows


results = out.get('response', {}).get('docs', [])
all_results = results

# Results will be an empty list if no more results are found
while results:
    # Rebuild url base on current start.
    url = base_url.format(rows, start)
    req = requests.get(url=url, auth=(user, password))
    out = req.json()
    results = out.get('response', {}).get('docs', [])
    all_results += results
    start += rows

# All results will now contains all the 'docs' of each request.
print(all_results)

Mind you.. those docs will be dict like, so more parsing will be needed. 请注意..那些文档会像dict一样,所以需要更多的解析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM