简体   繁体   English

如何快速取消嵌套 Pandas 数据框

[英]How to quickly un-nest a Pandas dataframe

A JSON file I need to work with imports into a dataframe with lists nested inside, before converting to a dataframe it is a list of nested dicts.我需要将一个 JSON 文件导入到一个带有嵌套列表的数据帧中,在转换为数据帧之前,它是一个嵌套字典的列表。 The file itself is nested.文件本身是嵌套的。

Sample JSON:示例 JSON:

{
  "State": [
    {
      "ts": "2018-04-11T21:37:05.401Z",
      "sensor": [
        "accBodyX_ftPerSec2"
      ],
      "value": null
    },
    {
      "ts": "2018-04-11T21:37:05.901Z",
      "sensor": [
        "accBodyX_ftPerSec2"
      ],
      "value": [
        -3.38919
      ]
    },
    {
      "ts": "2018-04-11T21:37:05.901Z",
      "sensor": [
        "accBodyY_ftPerSec2"
      ],
      "value": [
        -2.004781
      ]
    },
    {
      "ts": "2018-04-11T21:37:05.901Z",
      "sensor": [
        "accBodyZ_ftPerSec2"
      ],
      "value": [
        -34.77694
      ]
    }
  ]
}

The dataframe looks like:数据框看起来像:

    sensor                  ts                          value
0   [accBodyX_ftPerSec2]    2018-04-11T21:37:05.901Z    [-3.38919]
1   [accBodyY_ftPerSec2]    2018-04-11T21:37:05.901Z    [-2.004781]
2   [accBodyZ_ftPerSec2]    2018-04-11T21:37:05.901Z    [-34.77694]

Ultimately, I'd like the remove the nesting or find a way to work with it.最终,我希望删除嵌套或找到一种使用它的方法。 The goal is to extract a list of values for a given sensor name with accompanying timestamp into another dataframe for processing/plotting, something like this:目标是将给定传感器名称的值列表以及随附的时间戳提取到另一个数据帧中以进行处理/绘图,如下所示:

    ts                         value
0   2018-04-11T21:37:05.901Z   -3.38919
1   2018-04-11T21:37:06.401Z   -3.00241
2   2018-04-11T21:37:06.901Z   -3.87694

To remove the nesting I've done this but it is slow on just 100,000 rows but thankfully much faster than a for loop.为了删除嵌套,我已经这样做了,但它在 100,000 行上很慢,但幸运的是比 for 循环快得多。 (made possible thanks to this post python pandas operations on columns ) (多亏了这篇文章python pandas 对列的操作才成为可能)

def func(row):
    row.sensor = row.sensor[0]
    if type(row.value) is list:
        row.value = row.value[0]
    return row

df.apply(func, axis=1)

For working with the nesting I'm able to extract individual values.为了使用嵌套,我能够提取单个值。 For example this:例如这个:

print( df.iloc[:,2].iloc[1][0] )
-2.004781

However, trying to return a list of values from index 0 of each list within each row results in returning just the first value:但是,尝试从每行中每个列表的索引 0 返回值列表会导致仅返回第一个值:

print( df.iloc[:,2].iloc[:][0] )
-3.38919

Of course I could do this with a for loop but I know there's a way to do it with Pandas functions that I'm not able to discover yet.当然,我可以用 for 循环来做到这一点,但我知道有一种方法可以用 Pandas 函数来做到这一点,但我还没有发现。

You may need to just do some manual cleaning-up before reading into a DataFrame:在读入 DataFrame 之前,您可能只需要进行一些手动清理:

>>> import json
>>> import pandas as pd


>>> def collapse_lists(data):
...     return [{k: v[0] if (isinstance(v, list) and len(v) == 1)
...             else v for k, v in d.items()} for d in data]


>>> with open('state.json') as f:
...     data = pd.DataFrame(collapse_lists(json.load(f)['State']))

>>> data
               sensor                        ts      value
0  accBodyX_ftPerSec2  2018-04-11T21:37:05.401Z        NaN
1  accBodyX_ftPerSec2  2018-04-11T21:37:05.901Z  -3.389190
2  accBodyY_ftPerSec2  2018-04-11T21:37:05.901Z  -2.004781
3  accBodyZ_ftPerSec2  2018-04-11T21:37:05.901Z -34.776940

This loads the JSON file into a Python list of dictionaries, converts any length-1 lists into scalar values, and then loads that result into a DataFrame.这会将 JSON 文件加载到 Python 字典列表中,将任何长度为 1 的列表转换为标量值,然后将结果加载到 DataFrame 中。 That is admittedly not the most efficient means, but your other option of parsing the JSON itself is probably overkill unless the file is massive.诚然,这不是最有效的方法,但除非文件很大,否则解析 JSON 本身的其他选择可能有点过分。

Finally, to convert to datetime:最后,转换为日期时间:

>>> data['ts'] = pd.to_datetime(data['ts'])

>>> data.dtypes
sensor            object
ts        datetime64[ns]
value            float64
dtype: object

You may also want to consider converting sensor to a categorical data type to save a possibly significant amount of memory:您可能还需要考虑将sensor转换为分类数据类型以节省大量内存:

The memory usage of a Categorical is proportional to the number of categories plus the length of the data. Categorical 的内存使用量与类别数加上数据长度成正比。 In contrast, an object dtype is a constant times the length of the data.相比之下,对象 dtype 是数据长度的常数倍。 (source) (来源)


In explicit-loop form, this would look like:在显式循环形式中,这看起来像:

def collapse_lists(data):
    result = []
    for d in data:
        entry = {}
        for k, v in d.items():
            if isinstance(k, list) and len(v) == 1:
                entry.update({k: v[0]})
            else:
                entry.update({k: v})
        result.append(entry)
    return result

In case you ever get the case where you have multiple values/sensors, here's some code that might help.如果您遇到有多个值/传感器的情况,这里有一些可能会有所帮助的代码。

The test JSON (modified to have multiple values/sensors):测试 JSON(修改为具有多个值/传感器):

{
    "State": [
        {
            "ts": "2018-04-11T21:37:05.401Z",
            "sensor": [
                "accBodyX_ftPerSec2"
            ],
            "value": null
        },
        {
            "ts": "2018-04-11T21:37:05.100Z",
            "sensor": [
                "accBodyX_ftPerSec2",
                "accBodyY_ftPerSec2"
            ],
            "value": null
        },
        {
            "ts": "2018-04-11T21:37:05.901Z",
            "sensor": [
                "accBodyX_ftPerSec2"
            ],
            "value": [
                -3.38919
            ]
        },
        {
            "ts": "2018-04-11T21:37:05.901Z",
            "sensor": [
                "accBodyY_ftPerSec2"
            ],
            "value": [
                 -2.004781
            ]
        },
        {
            "ts": "2018-04-11T21:37:05.901Z",
            "sensor": [
                "accBodyX_ftPerSec2",
                "accBodyY_ftPerSec2",
                "accBodyZ_ftPerSec2"
            ],
            "value": [
                -1.234567,
                4.56789,
                -34.77694
            ]
        }
    ]
}

Some code to beat it into a df such that each timestamp/sensor combo is a new row:一些代码将它打成一个 df,这样每个时间戳/传感器组合都是一个新行:

import json
import pandas as pd

def grab_json(json_filename):
    with open(json_filename, 'r') as f:
        json_str = f.read()
    json_dict = json.loads(json_str)
    resturn json_dict

def create_row_per_timestamp_and_sensor(data):
    result = []
    for sub_dict in data:
        # Make sure we have an equal number of sensors/values
        values = [None]*len(sub_dict['sensor']) if sub_dict['value'] is None else sub_dict['value']

        # Zip and iterate over each sensor/value respectively
        for sensor, value in zip(sub_dict['sensor'], values):
            result.append({'ts': sub_dict['ts'],
                           'sensor': sensor,
                           'value': value})
    return result


json_dict = grab_json("df.json")  # instead of "df.json" put your filename instead
df_list = create_row_per_timestamp_and_sensor(json_dict['State'])
new_df = pd.DataFrame(df_list)
print(new_df)

outputs:输出:

               sensor                        ts      value
0  accBodyX_ftPerSec2  2018-04-11T21:37:05.401Z        NaN
1  accBodyX_ftPerSec2  2018-04-11T21:37:05.100Z        NaN
2  accBodyY_ftPerSec2  2018-04-11T21:37:05.100Z        NaN
3  accBodyX_ftPerSec2  2018-04-11T21:37:05.901Z  -3.389190
4  accBodyY_ftPerSec2  2018-04-11T21:37:05.901Z  -2.004781
5  accBodyX_ftPerSec2  2018-04-11T21:37:05.901Z  -1.234567
6  accBodyY_ftPerSec2  2018-04-11T21:37:05.901Z   4.567890
7  accBodyZ_ftPerSec2  2018-04-11T21:37:05.901Z -34.776940

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM