简体   繁体   中英

python: getting npm package data from a couchdb endpoint

I want to fetch the npm package metadata. I found this endpoint which gives me all the metadata needed.

I made a following script to get this data. My plan is to select some specific keys and add that data in some database (I can also store it in a json file, but the data is huge). I made following script to fetch the data:

import requests
import json
import sys

db = 'https://replicate.npmjs.com';

r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})

for line in r.iter_lines():
    # filter out keep-alive new lines
    if line:
        print(line)
        decoded_line = line.decode('utf-8')
        print(json.loads(decoded_line))

Notice, I don't even include all-docs , but it sticks in an infinite loop. I think this is because the data is huge.

A look at the head of the output from - https://replicate.npmjs.com/_all_docs

gives me following output:

{"total_rows":1017703,"offset":0,"rows":[
{"id":"0","key":"0","value":{"rev":"1-5fbff37e48e1dd03ce6e7ffd17b98998"}},
{"id":"0-","key":"0-","value":{"rev":"1-420c8f16ec6584c7387b19ef401765a4"}},
{"id":"0----","key":"0----","value":{"rev":"1-55f4221814913f0e8f861b1aa42b02e4"}},
{"id":"0-1-project","key":"0-1-project","value":{"rev":"1-3cc19950252463c69a5e717d9f8f0f39"}},
{"id":"0-100","key":"0-100","value":{"rev":"1-c4f41a37883e1289f469d5de2a7b505a"}},
{"id":"0-24","key":"0-24","value":{"rev":"1-e595ec3444bc1039f10c062dd86912a2"}},
{"id":"0-60","key":"0-60","value":{"rev":"2-32c17752acfe363fa1be7dbd38212b0a"}},
{"id":"0-9","key":"0-9","value":{"rev":"1-898c1d89f7064e58f052ff492e94c753"}},
{"id":"0-_-0","key":"0-_-0","value":{"rev":"1-d47c142e9460c815c19c4ed3355d648d"}},
{"id":"0.","key":"0.","value":{"rev":"1-11c33605f2e3fd88b5416106fcdbb435"}},
{"id":"0.0","key":"0.0","value":{"rev":"1-5e541d4358c255cbcdba501f45a66e82"}},
{"id":"0.0.1","key":"0.0.1","value":{"rev":"1-ce856c27d0e16438a5849a97f8e9671d"}},
{"id":"0.0.168","key":"0.0.168","value":{"rev":"1-96ab3047e57ca1573405d0c89dd7f3f2"}},
{"id":"0.0.250","key":"0.0.250","value":{"rev":"1-c07ad0ffb7e2dc51bfeae2838b8d8bd6"}}, 

Notice, that all the documents start from the second line (ie all the documents are part of the "rows" key's values). Now, my question is how to get only the values of "rows" key (ie all the documents). I found this repository for the similar purpose, but can't use/ convert it as I am a total beginner in JavaScript.

If there is no stream=True among the arguments of get() then the whole data will be downloaded into memory before the loop over the lines even starts.

Then there is the problem that at least the lines themselves are not valid JSON. You'll need an incremental JSON parser like ijson for this. ijson in turn wants a file like object which isn't easily obtained from the requests.Response , so I will use urllib from the Python standard library here:

#!/usr/bin/env python3
from urllib.request import urlopen

import ijson


def main():
    with urlopen('https://replicate.npmjs.com/_all_docs') as json_file:
        for row in ijson.items(json_file, 'rows.item'):
            print(row)


if __name__ == '__main__':
    main()

Is there a reason why you aren't decoding the json before iterating over the lines?

Can you try this:

import requests
import json
import sys

db = 'https://replicate.npmjs.com';

r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})

decoded_r = r.decode('utf-8')
data = json.loads(decoded_r)

for row in data.rows:
    print(row.key)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM