简体   繁体   English

将字典列表转换为数据框

[英]Convert list of list of dicts to dataframe

I have large json data that is read into a python dataframe, and creates a list of dicts per row. 我有大的json数据被读入python数据框,并每行创建一个字典列表。 I need to convert it into a different format of data. 我需要将其转换为其他格式的数据。

The data format is the following: 数据格式如下:

{
    "data": [{
            "item": [{
                    "value": 0,
                    "type": "a"
                },
                {
                    "value": 0,
                    "type": "b"
                },
                {
                    "value": 70,
                    "type": "c"
                }
            ],
            "timestamp": "2019-01-12T04:52:06.669Z"
        },
        {
            "item": [{
                    "value": 30,
                    "type": "a"
                },
                {
                    "value": 0,
                    "type": "b"
                }
            ],
            "timestamp": "2019-01-12T04:53:06.669z"
        }
    ]
}

What would be the most efficient way of converting the data to a dataframe of the form: 将数据转换为以下形式的数据框的最有效方法是:

timestamp-----------------------------a-------b------c 时间戳----------------------------- a ------- b ------ c

2019-01-12T04:52:06.669Z------0-------0------70 2019-01-12T04:52:06.669Z ------ 0 ------- 0 ------ 70
2019-01-12T04:53:06.669Z------30------0------0 2019-01-12T04:53:06.669Z ------ 30 ------ 0 ------ 0

So far I have managed to do it using for loops, but its very inefficient and slow. 到目前为止,我已经设法使用for循环来做到这一点,但是它非常低效且缓慢。 What I have so far is this. 到目前为止,我所拥有的是这个。

with open('try.json') as f:
    data = json.load(f)

df_data = pandas.DataFrame(data['data'])
df_formatted = pandas.DataFrame(columns=['a','b','c'])

for d, timestamp in zip(df_data['item'], df_data['timestamp']):
    row = dict()
    for entry in d:
        category = entry['type']
        value = entry['value']
        row[category] = value
    row['timestamp'] = timestamp
    df_formatted = df_formatted.append(row, ignore_index=True)
df = df_formatted.fillna(0)

The number of items in the list is often in several thousands.Any pointers or examples about how to do this efficienty? 列表中的项目数通常为数千个。有关如何有效执行此操作的任何指针或示例?

You can unpack the nested json object by iterating over the objects. 您可以通过遍历对象来解压缩嵌套的json对象。 Try 尝试

import pandas as pd
a=[
      {
       "item": [
          {
            "value": 0,
            "type": "a"
          },
          {
            "value": 0,
            "type": "b"
          },
          {
            "value": 70,
            "type": "c"
          },
        ],
        "timestamp": "2019-01-12T04:52:06.669Z"
     },
     {
        "item": [
          {
            "value": 30,
            "type": "a"
          },
          {
            "value": 0,
            "type": "b"
          }
        ],
        "timestamp": "2019-01-12T04:53:06.669z"
      }
]


cols = ['value', 'type', 'timestamp']

rows = []
for data in a:
    data_row = data['item']
    timestamp = data['timestamp']
    for row in data_row:
        row['timestamp']=timestamp
        rows.append(row)

df = pd.DataFrame(rows)
df =df.pivot_table(index='timestamp',columns=['type'],values=['value']).reset_index()
df.columns=['timestamp','a','b','c']

If you are looking for a compact solution use json_normalize 如果您正在寻找一个紧凑的解决方案,请使用json_normalize

from pandas.io.json import json_normalize
df =pd.DataFrame()
for i in range(len(a)):
    df =pd.concat([df,json_normalize(a[i]['item'])])
df =df.pivot_table(index='timestamp',columns=['type'],values=['value']).reset_index()
df.columns=['timestamp','a','b','c']

Final output 最终输出

timestamp                   a       b       c
2019-01-12T04:52:06.669Z    0.0     0.0     70.0
2019-01-12T04:53:06.669z    30.0    0.0     NaN

You can extract a list of dictionaries from the json and feed it into a dataframe. 您可以从json提取字典列表,然后将其输入数据框。 Code could be: 代码可以是:

df = pd.DataFrame([dict([('timestamp', d['timestamp']), ('a', 0),
                         ('b', 0), ('c', 0)]
                        + [(item['type'], item['value'])
                           for item in d['item']])for d in data['data']],
                  columns=['timestamp', 'a', 'b', 'c'])

print(df)

outputs as expected: 预期的输出:

                  timestamp   a  b   c
0  2019-01-12T04:52:06.669Z   0  0  70
1  2019-01-12T04:53:06.669z  30  0   0

The trick here is to first build a list of pairs with default values and then extend it with the actual values before building a dict from it. 这里的技巧是首先构建一个具有默认值的对列表,然后在根据其构建字典之前,使用实际值对其进行扩展。 As the last seen values is kept, you actually build a dictionnary containing all relevant values. 保留最后看到的值后,您实际上将构建一个包含所有相关值的字典。

The columns parameter is only present to ensure the expected order of columns. 仅存在columns参数,以确保预期的列顺序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM