将字典列表转换为数据框

Question

I have large json data that is read into a python dataframe, and creates a list of dicts per row. 我有大的json数据被读入python数据框，并每行创建一个字典列表。 I need to convert it into a different format of data. 我需要将其转换为其他格式的数据。

The data format is the following: 数据格式如下：

{
    "data": [{
            "item": [{
                    "value": 0,
                    "type": "a"
                },
                {
                    "value": 0,
                    "type": "b"
                },
                {
                    "value": 70,
                    "type": "c"
                }
            ],
            "timestamp": "2019-01-12T04:52:06.669Z"
        },
        {
            "item": [{
                    "value": 30,
                    "type": "a"
                },
                {
                    "value": 0,
                    "type": "b"
                }
            ],
            "timestamp": "2019-01-12T04:53:06.669z"
        }
    ]
}

What would be the most efficient way of converting the data to a dataframe of the form: 将数据转换为以下形式的数据框的最有效方法是：

timestamp-----------------------------a-------b------c 时间戳----------------------------- a ------- b ------ c

2019-01-12T04:52:06.669Z------0-------0------70 2019-01-12T04：52：06.669Z ------ 0 ------- 0 ------ 70
2019-01-12T04:53:06.669Z------30------0------0 2019-01-12T04：53：06.669Z ------ 30 ------ 0 ------ 0

So far I have managed to do it using for loops, but its very inefficient and slow. 到目前为止，我已经设法使用for循环来做到这一点，但是它非常低效且缓慢。 What I have so far is this. 到目前为止，我所拥有的是这个。

with open('try.json') as f:
    data = json.load(f)

df_data = pandas.DataFrame(data['data'])
df_formatted = pandas.DataFrame(columns=['a','b','c'])

for d, timestamp in zip(df_data['item'], df_data['timestamp']):
    row = dict()
    for entry in d:
        category = entry['type']
        value = entry['value']
        row[category] = value
    row['timestamp'] = timestamp
    df_formatted = df_formatted.append(row, ignore_index=True)
df = df_formatted.fillna(0)

The number of items in the list is often in several thousands.Any pointers or examples about how to do this efficienty? 列表中的项目数通常为数千个。有关如何有效执行此操作的任何指针或示例？

Answer 1

You can unpack the nested json object by iterating over the objects. 您可以通过遍历对象来解压缩嵌套的json对象。 Try 尝试

import pandas as pd
a=[
      {
       "item": [
          {
            "value": 0,
            "type": "a"
          },
          {
            "value": 0,
            "type": "b"
          },
          {
            "value": 70,
            "type": "c"
          },
        ],
        "timestamp": "2019-01-12T04:52:06.669Z"
     },
     {
        "item": [
          {
            "value": 30,
            "type": "a"
          },
          {
            "value": 0,
            "type": "b"
          }
        ],
        "timestamp": "2019-01-12T04:53:06.669z"
      }
]


cols = ['value', 'type', 'timestamp']

rows = []
for data in a:
    data_row = data['item']
    timestamp = data['timestamp']
    for row in data_row:
        row['timestamp']=timestamp
        rows.append(row)

df = pd.DataFrame(rows)
df =df.pivot_table(index='timestamp',columns=['type'],values=['value']).reset_index()
df.columns=['timestamp','a','b','c']

If you are looking for a compact solution use json_normalize 如果您正在寻找一个紧凑的解决方案，请使用json_normalize

from pandas.io.json import json_normalize
df =pd.DataFrame()
for i in range(len(a)):
    df =pd.concat([df,json_normalize(a[i]['item'])])
df =df.pivot_table(index='timestamp',columns=['type'],values=['value']).reset_index()
df.columns=['timestamp','a','b','c']

Final output 最终输出

timestamp                   a       b       c
2019-01-12T04:52:06.669Z    0.0     0.0     70.0
2019-01-12T04:53:06.669z    30.0    0.0     NaN

Answer 2

You can extract a list of dictionaries from the json and feed it into a dataframe. 您可以从json提取字典列表，然后将其输入数据框。 Code could be: 代码可以是：

df = pd.DataFrame([dict([('timestamp', d['timestamp']), ('a', 0),
                         ('b', 0), ('c', 0)]
                        + [(item['type'], item['value'])
                           for item in d['item']])for d in data['data']],
                  columns=['timestamp', 'a', 'b', 'c'])

print(df)

outputs as expected: 预期的输出：

                  timestamp   a  b   c
0  2019-01-12T04:52:06.669Z   0  0  70
1  2019-01-12T04:53:06.669z  30  0   0

The trick here is to first build a list of pairs with default values and then extend it with the actual values before building a dict from it. 这里的技巧是首先构建一个具有默认值的对列表，然后在根据其构建字典之前，使用实际值对其进行扩展。 As the last seen values is kept, you actually build a dictionnary containing all relevant values. 保留最后看到的值后，您实际上将构建一个包含所有相关值的字典。

The columns parameter is only present to ensure the expected order of columns. 仅存在columns参数，以确保预期的列顺序。

将字典列表转换为数据框

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-02-14 15:19:17

解决方案2
0 2019-02-14 16:00:53

将字典列表转换为数据框

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-02-14 15:19:17

解决方案2 0 2019-02-14 16:00:53

解决方案1
2 已采纳 2019-02-14 15:19:17

解决方案2
0 2019-02-14 16:00:53