简体   繁体   English

在 python 中处理大型 JSON 数据

[英]Handling with large JSON data in python

My JSON (~500mb) file has multiple JSON objetcs, actually i just need to use the "customer_id" colunm.我的 JSON (~500mb) 文件有多个 JSON objetcs,实际上我只需要使用“customer_id”列。 When i execute the code below, it gives memory error.当我执行下面的代码时,它会出现内存错误。

with open('online_pageviews.json') as f:
     online_pageviews = pd.DataFrame(json.loads(line) for line in f)

Here is a example of an JSON object in "online_pageviews.json"这是“online_pageviews.json”中的 JSON 对象示例

{
"date": "2018-08-01",
"visitor_id": "3832636531373538373137373",
"deviceType": "mobile",
"pageType": "product",
"category_id": "6365313034",
"on_product_id": "323239323839626",
"customer_id": "33343163316564313264"
}

Is there a way to only use the "customer_id" column?有没有办法只使用“customer_id”列? What can i do to load this file?我该怎么做才能加载这个文件?

You should be able to do this if you manage the amount of data you actually have floating around.如果您管理实际浮动的数据量,您应该能够做到这一点。 Since you only need the customer ID don't bother loading any of the other data into your dataframe.由于您需要客户 ID,因此不必费心将任何其他数据加载到您的数据框中。

customer_id_array = []
with open('online_pageviews.json') as f:
    for line in f:
        customer_id_array.append(json.loads(line)['customer_id'])
online_pageviews = pd.DataFrame(customer_id_array,columns = ['customer_id'])

This way can you significantly cut down on how much extra memory you were previously using.通过这种方式,您可以显着减少之前使用的额外内存量。

(Im not sure if your system will be able to handle this as customer_id_array can still get pretty big but it should be much better than before. If it cannot you may need to look for some online options for renting systems with more memory.) (我不确定您的系统是否能够处理这个问题,因为customer_id_array仍然可以变得相当大,但它应该比以前好得多。如果不能,您可能需要寻找一些在线选项来租用具有更多内存的系统。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM