如何将 append 一个 DASK 包转至另一个 DASK 包？

Question

I am using Pandas to fetch around 2 million records from an API that returns a JSON object.我正在使用 Pandas 从返回 JSON ZA8CFDE63311BD59EB2AC966FZ 的 API 获取大约 200 万条记录The API has a limit of returning only 5000 JSON object at a time, so I iterate over the API calls to fetch the JSONs. The API has a limit of returning only 5000 JSON object at a time, so I iterate over the API calls to fetch the JSONs. These are the steps that I follow: 1. Get all the record_ids in a list.这些是我遵循的步骤： 1. 获取列表中的所有记录 ID。 2. Create API calls (URLs) by breaking the record_ids into chunks of 5000 each. 2. 创建 API 调用 (URL)，方法是将 record_ids 分成每个 5000 个的块。 3. Iterate over the created URLs to fetch the JSONs. 3. 遍历创建的 URL 以获取 JSON。 4. Create a list of JSONs that were fetched above. 4. 创建上面提取的 JSON 列表。 5. Use pd.io.json.json_normalize to create the dataframe. 5. 使用 pd.io.json.json_normalize 创建 dataframe。

The problem is that I am running out of memory if I exceed a certain limit of records to be fetched.问题是如果我超过了要获取的记录的特定限制，我的 memory 就会用完。 I am trying to use DASK to help with the memory issue.我正在尝试使用 DASK 来帮助解决 memory 问题。 However, I am unable to figure out how to use DASK bags to perform a similar function as lists (eg Append).但是，我无法弄清楚如何使用 DASK 包来执行类似的 function 作为列表（例如附加）。 Or, how do I add on more JSONs returned by the iterative API calls onto the same DASK bag?或者，如何将迭代 API 调用返回的更多 JSON 添加到同一个 DASK 包上？

This is the code that I am using, and it work fine for smaller datasets:这是我正在使用的代码，它适用于较小的数据集：

import pandas as pd
import json
import requests
import getpass

# Specify the date range and system for which the recordIDs need to be fetched
recordIDsURL = 'http://example.com:8071/records/getIds?system=ABC&daterange=2019-01-15,2019-10-15'

# Specify the record service API which returns the record info for provided record ids
recordServiceURL = 'http://example:8071/records/'

# Get the recordIds for the provided date range and system
request = requests.get(recordIDsURL, auth = requests.auth.HTTPBasicAuth(username, password))

# Put the recordIds into a list
listid = request.json()

# Divide the recordIDs into smaller lists containing 5000 recordIDs 
listChunks = [listid[x:x+5000] for x in range(0, len(listid), 5000)]

# Make a list for disctinct URLs for calling the API
url = [0 for i in range(len(listChunks))]

# Make a list for storing the result of the URL calls
recordRequest = [0 for i in range(len(listChunks))]

# Make a list for converting the result of the URL calls into a list of JSONs
jsonList = [0 for i in range(len(listChunks))]

# Iterate over the URL calls 
for i in range(len(listChunks)):
    url[i] = recordServiceURL + (','.join(listChunks[i]))
    recordRequest[i] = requests.get(url[i], auth = requests.auth.HTTPBasicAuth(username, password))
    jsonList[i] = recordRequest[i].json()

# Merge the JSON list into a single JSON to load into DF
mergeJson = []
for i in jsonList:
    mergeJson += i

df = pd.io.json.json_normalize(mergeJson)

In a nut sheet, I am hoping to use DASK bags and DASK dataframe in place of the python list and pandas dataframe in the above code. In a nut sheet, I am hoping to use DASK bags and DASK dataframe in place of the python list and pandas dataframe in the above code.

Answer 1

You can concatenate many Dask Bags with the dask.bag.concat function您可以使用dask.bag.concat function 连接许多 Dask Bag

如何将 append 一个 DASK 包转至另一个 DASK 包？

问题描述

1 个解决方案

解决方案1
0 2019-10-19 13:37:33

如何将 append 一个 DASK 包转至另一个 DASK 包？

问题描述

1 个解决方案

解决方案1 0 2019-10-19 13:37:33

解决方案1
0 2019-10-19 13:37:33