简体   繁体   English

蟒蛇:从couchdb端点获取npm包数据

[英]python: getting npm package data from a couchdb endpoint

I want to fetch the npm package metadata. 我想获取npm包元数据。 I found this endpoint which gives me all the metadata needed. 我发现端点为我提供了所需的所有元数据。

I made a following script to get this data. 我编写了以下脚本来获取此数据。 My plan is to select some specific keys and add that data in some database (I can also store it in a json file, but the data is huge). 我的计划是选择一些特定的键并将该数据添加到某些数据库中(我也可以将其存储在json文件中,但是数据很大)。 I made following script to fetch the data: 我做了以下脚本来获取数据:

import requests
import json
import sys

db = 'https://replicate.npmjs.com';

r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})

for line in r.iter_lines():
    # filter out keep-alive new lines
    if line:
        print(line)
        decoded_line = line.decode('utf-8')
        print(json.loads(decoded_line))

Notice, I don't even include all-docs , but it sticks in an infinite loop. 注意,我什至不包括all-docs ,但它会陷入无限循环。 I think this is because the data is huge. 我认为这是因为数据巨大。

A look at the head of the output from - https://replicate.npmjs.com/_all_docs 看看-https://replicate.npmjs.com/_all_docs的输出标题

gives me following output: 给我以下输出:

{"total_rows":1017703,"offset":0,"rows":[
{"id":"0","key":"0","value":{"rev":"1-5fbff37e48e1dd03ce6e7ffd17b98998"}},
{"id":"0-","key":"0-","value":{"rev":"1-420c8f16ec6584c7387b19ef401765a4"}},
{"id":"0----","key":"0----","value":{"rev":"1-55f4221814913f0e8f861b1aa42b02e4"}},
{"id":"0-1-project","key":"0-1-project","value":{"rev":"1-3cc19950252463c69a5e717d9f8f0f39"}},
{"id":"0-100","key":"0-100","value":{"rev":"1-c4f41a37883e1289f469d5de2a7b505a"}},
{"id":"0-24","key":"0-24","value":{"rev":"1-e595ec3444bc1039f10c062dd86912a2"}},
{"id":"0-60","key":"0-60","value":{"rev":"2-32c17752acfe363fa1be7dbd38212b0a"}},
{"id":"0-9","key":"0-9","value":{"rev":"1-898c1d89f7064e58f052ff492e94c753"}},
{"id":"0-_-0","key":"0-_-0","value":{"rev":"1-d47c142e9460c815c19c4ed3355d648d"}},
{"id":"0.","key":"0.","value":{"rev":"1-11c33605f2e3fd88b5416106fcdbb435"}},
{"id":"0.0","key":"0.0","value":{"rev":"1-5e541d4358c255cbcdba501f45a66e82"}},
{"id":"0.0.1","key":"0.0.1","value":{"rev":"1-ce856c27d0e16438a5849a97f8e9671d"}},
{"id":"0.0.168","key":"0.0.168","value":{"rev":"1-96ab3047e57ca1573405d0c89dd7f3f2"}},
{"id":"0.0.250","key":"0.0.250","value":{"rev":"1-c07ad0ffb7e2dc51bfeae2838b8d8bd6"}}, 

Notice, that all the documents start from the second line (ie all the documents are part of the "rows" key's values). 注意,所有文档都从第二行开始(即,所有文档都是“行”键值的一部分)。 Now, my question is how to get only the values of "rows" key (ie all the documents). 现在,我的问题是如何仅获取“行”键的值(即所有文档)。 I found this repository for the similar purpose, but can't use/ convert it as I am a total beginner in JavaScript. 我出于类似目的找到了存储库,但由于我是JavaScript的初学者,因此无法使用/转换它。

If there is no stream=True among the arguments of get() then the whole data will be downloaded into memory before the loop over the lines even starts. 如果get()的参数中没有stream=True ,则在循环开始之前 ,所有数据都将下载到内存中。

Then there is the problem that at least the lines themselves are not valid JSON. 然后是一个问题,至少这些行本身不是有效的JSON。 You'll need an incremental JSON parser like ijson for this. ijson ,您将需要像ijson这样的增量JSON解析器。 ijson in turn wants a file like object which isn't easily obtained from the requests.Response , so I will use urllib from the Python standard library here: ijson又希望像这是不容易从所获得的目标文件requests.Response ,所以我会用urllib从这里Python标准库:

#!/usr/bin/env python3
from urllib.request import urlopen

import ijson


def main():
    with urlopen('https://replicate.npmjs.com/_all_docs') as json_file:
        for row in ijson.items(json_file, 'rows.item'):
            print(row)


if __name__ == '__main__':
    main()

Is there a reason why you aren't decoding the json before iterating over the lines? 有没有理由在迭代这些行之前不解码json?

Can you try this: 你可以尝试一下:

import requests
import json
import sys

db = 'https://replicate.npmjs.com';

r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})

decoded_r = r.decode('utf-8')
data = json.loads(decoded_r)

for row in data.rows:
    print(row.key)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM