简体   繁体   English

如何将 append 一个 DASK 包转至另一个 DASK 包?

[英]How to append a DASK bag to another DASK bag?

I am using Pandas to fetch around 2 million records from an API that returns a JSON object.我正在使用 Pandas 从返回 JSON ZA8CFDE63311BD59EB2AC966FZ 的 API 获取大约 200 万条记录The API has a limit of returning only 5000 JSON object at a time, so I iterate over the API calls to fetch the JSONs. The API has a limit of returning only 5000 JSON object at a time, so I iterate over the API calls to fetch the JSONs. These are the steps that I follow: 1. Get all the record_ids in a list.这些是我遵循的步骤: 1. 获取列表中的所有记录 ID。 2. Create API calls (URLs) by breaking the record_ids into chunks of 5000 each. 2. 创建 API 调用 (URL),方法是将 record_ids 分成每个 5000 个的块。 3. Iterate over the created URLs to fetch the JSONs. 3. 遍历创建的 URL 以获取 JSON。 4. Create a list of JSONs that were fetched above. 4. 创建上面提取的 JSON 列表。 5. Use pd.io.json.json_normalize to create the dataframe. 5. 使用 pd.io.json.json_normalize 创建 dataframe。

The problem is that I am running out of memory if I exceed a certain limit of records to be fetched.问题是如果我超过了要获取的记录的特定限制,我的 memory 就会用完。 I am trying to use DASK to help with the memory issue.我正在尝试使用 DASK 来帮助解决 memory 问题。 However, I am unable to figure out how to use DASK bags to perform a similar function as lists (eg Append).但是,我无法弄清楚如何使用 DASK 包来执行类似的 function 作为列表(例如附加)。 Or, how do I add on more JSONs returned by the iterative API calls onto the same DASK bag?或者,如何将迭代 API 调用返回的更多 JSON 添加到同一个 DASK 包上?

This is the code that I am using, and it work fine for smaller datasets:这是我正在使用的代码,它适用于较小的数据集:

import pandas as pd
import json
import requests
import getpass

# Specify the date range and system for which the recordIDs need to be fetched
recordIDsURL = 'http://example.com:8071/records/getIds?system=ABC&daterange=2019-01-15,2019-10-15'

# Specify the record service API which returns the record info for provided record ids
recordServiceURL = 'http://example:8071/records/'

# Get the recordIds for the provided date range and system
request = requests.get(recordIDsURL, auth = requests.auth.HTTPBasicAuth(username, password))

# Put the recordIds into a list
listid = request.json()

# Divide the recordIDs into smaller lists containing 5000 recordIDs 
listChunks = [listid[x:x+5000] for x in range(0, len(listid), 5000)]

# Make a list for disctinct URLs for calling the API
url = [0 for i in range(len(listChunks))]

# Make a list for storing the result of the URL calls
recordRequest = [0 for i in range(len(listChunks))]

# Make a list for converting the result of the URL calls into a list of JSONs
jsonList = [0 for i in range(len(listChunks))]

# Iterate over the URL calls 
for i in range(len(listChunks)):
    url[i] = recordServiceURL + (','.join(listChunks[i]))
    recordRequest[i] = requests.get(url[i], auth = requests.auth.HTTPBasicAuth(username, password))
    jsonList[i] = recordRequest[i].json()

# Merge the JSON list into a single JSON to load into DF
mergeJson = []
for i in jsonList:
    mergeJson += i

df = pd.io.json.json_normalize(mergeJson)

In a nut sheet, I am hoping to use DASK bags and DASK dataframe in place of the python list and pandas dataframe in the above code. In a nut sheet, I am hoping to use DASK bags and DASK dataframe in place of the python list and pandas dataframe in the above code.

You can concatenate many Dask Bags with the dask.bag.concat function您可以使用dask.bag.concat function 连接许多 Dask Bag

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM