简体   繁体   中英

Efficient way of extracting data, from json file. And transforming it into dataframe

Imagine i have a list containing multiple dictionaries

Sample dict

{'city': None, 'bot-origin': None, 'campaign-source': 'attendance bot', 'lastState': 'productAvailabilityCpfValidationTrue', 'main-installation-date': None, 'userid': '00377a70-fc79-424e-80c3-1f0324094378@tunnel.msging.net', 'full-name': None, 'alternative-installation-date': None, 'chosen-product': 'Internet', 'bank': None, 'postalcode': '82100690', 'due-date': None, 'cpf': '07670115971', 'origin-link': '', 'payment': None, 'state': None, 'api-orders-hash-id': None, 'email': None, 'api-orders-error': None, 'plan-name': None, 'userphone': '41 9893-6613', 'plan-offer': None, 'completed-address': None, 'type-of-person': 'CPF', 'onboarding-simplified': None, 'type-of-product': 'Residencial', 'main-installation-period-day': None, 'plan-value': None, 'alternative-installation-period-day': None, 'data-change': 'false'}

The list contains around 9000000, events such as the one displayed.

What i want to do is basically, break them apart into a kinda of dataframe format such as pd.DataFrame() (i dont insist on it), but unfortunately. I tried commands such as pd.json_normalize() , read_json , from_records and so on and they seen to be well consuming all my memory. My approach is to do some sort of chunksize, where i split the list/series into chunks, load them into variables put them into df format save them, and then clean out the memory, and after that concatenate everything. So you know my pc doesnt crash while trying to load everything at once.

Here is my attempt

def forma_extras(extras):
   # Extras = serialized json, in series object format
   for i in range(0,extras.size[0],100):
        #Having a little trouble here

My solutions was something like this At least my computer doesnt crash when i run this Is this the most efficient i am not sure, maybe saving it and just taking out of memory would be better, will be doing following steps

def forma_extras(extras):
    chunk = 100000
    l_extra = []
    for i in range(0,len(extras),chunk):
        chunks = i + chunk
        l_extra.append(pd.DataFrame.from_records(extras[i:chunks]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM