[英]How to quickly un-nest a Pandas dataframe
A JSON file I need to work with imports into a dataframe with lists nested inside, before converting to a dataframe it is a list of nested dicts.我需要将一个 JSON 文件导入到一个带有嵌套列表的数据帧中,在转换为数据帧之前,它是一个嵌套字典的列表。 The file itself is nested.
文件本身是嵌套的。
Sample JSON:示例 JSON:
{
"State": [
{
"ts": "2018-04-11T21:37:05.401Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": null
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": [
-3.38919
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyY_ftPerSec2"
],
"value": [
-2.004781
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyZ_ftPerSec2"
],
"value": [
-34.77694
]
}
]
}
The dataframe looks like:数据框看起来像:
sensor ts value
0 [accBodyX_ftPerSec2] 2018-04-11T21:37:05.901Z [-3.38919]
1 [accBodyY_ftPerSec2] 2018-04-11T21:37:05.901Z [-2.004781]
2 [accBodyZ_ftPerSec2] 2018-04-11T21:37:05.901Z [-34.77694]
Ultimately, I'd like the remove the nesting or find a way to work with it.最终,我希望删除嵌套或找到一种使用它的方法。 The goal is to extract a list of values for a given sensor name with accompanying timestamp into another dataframe for processing/plotting, something like this:
目标是将给定传感器名称的值列表以及随附的时间戳提取到另一个数据帧中以进行处理/绘图,如下所示:
ts value
0 2018-04-11T21:37:05.901Z -3.38919
1 2018-04-11T21:37:06.401Z -3.00241
2 2018-04-11T21:37:06.901Z -3.87694
To remove the nesting I've done this but it is slow on just 100,000 rows but thankfully much faster than a for loop.为了删除嵌套,我已经这样做了,但它在 100,000 行上很慢,但幸运的是比 for 循环快得多。 (made possible thanks to this post python pandas operations on columns )
(多亏了这篇文章python pandas 对列的操作才成为可能)
def func(row):
row.sensor = row.sensor[0]
if type(row.value) is list:
row.value = row.value[0]
return row
df.apply(func, axis=1)
For working with the nesting I'm able to extract individual values.为了使用嵌套,我能够提取单个值。 For example this:
例如这个:
print( df.iloc[:,2].iloc[1][0] )
-2.004781
However, trying to return a list of values from index 0 of each list within each row results in returning just the first value:但是,尝试从每行中每个列表的索引 0 返回值列表会导致仅返回第一个值:
print( df.iloc[:,2].iloc[:][0] )
-3.38919
Of course I could do this with a for loop but I know there's a way to do it with Pandas functions that I'm not able to discover yet.当然,我可以用 for 循环来做到这一点,但我知道有一种方法可以用 Pandas 函数来做到这一点,但我还没有发现。
You may need to just do some manual cleaning-up before reading into a DataFrame:在读入 DataFrame 之前,您可能只需要进行一些手动清理:
>>> import json
>>> import pandas as pd
>>> def collapse_lists(data):
... return [{k: v[0] if (isinstance(v, list) and len(v) == 1)
... else v for k, v in d.items()} for d in data]
>>> with open('state.json') as f:
... data = pd.DataFrame(collapse_lists(json.load(f)['State']))
>>> data
sensor ts value
0 accBodyX_ftPerSec2 2018-04-11T21:37:05.401Z NaN
1 accBodyX_ftPerSec2 2018-04-11T21:37:05.901Z -3.389190
2 accBodyY_ftPerSec2 2018-04-11T21:37:05.901Z -2.004781
3 accBodyZ_ftPerSec2 2018-04-11T21:37:05.901Z -34.776940
This loads the JSON file into a Python list of dictionaries, converts any length-1 lists into scalar values, and then loads that result into a DataFrame.这会将 JSON 文件加载到 Python 字典列表中,将任何长度为 1 的列表转换为标量值,然后将结果加载到 DataFrame 中。 That is admittedly not the most efficient means, but your other option of parsing the JSON itself is probably overkill unless the file is massive.
诚然,这不是最有效的方法,但除非文件很大,否则解析 JSON 本身的其他选择可能有点过分。
Finally, to convert to datetime:最后,转换为日期时间:
>>> data['ts'] = pd.to_datetime(data['ts'])
>>> data.dtypes
sensor object
ts datetime64[ns]
value float64
dtype: object
You may also want to consider converting sensor
to a categorical data type to save a possibly significant amount of memory:您可能还需要考虑将
sensor
转换为分类数据类型以节省大量内存:
The memory usage of a Categorical is proportional to the number of categories plus the length of the data.
Categorical 的内存使用量与类别数加上数据长度成正比。 In contrast, an object dtype is a constant times the length of the data.
相比之下,对象 dtype 是数据长度的常数倍。 (source)
(来源)
In explicit-loop form, this would look like:在显式循环形式中,这看起来像:
def collapse_lists(data):
result = []
for d in data:
entry = {}
for k, v in d.items():
if isinstance(k, list) and len(v) == 1:
entry.update({k: v[0]})
else:
entry.update({k: v})
result.append(entry)
return result
In case you ever get the case where you have multiple values/sensors, here's some code that might help.如果您遇到有多个值/传感器的情况,这里有一些可能会有所帮助的代码。
The test JSON (modified to have multiple values/sensors):测试 JSON(修改为具有多个值/传感器):
{
"State": [
{
"ts": "2018-04-11T21:37:05.401Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": null
},
{
"ts": "2018-04-11T21:37:05.100Z",
"sensor": [
"accBodyX_ftPerSec2",
"accBodyY_ftPerSec2"
],
"value": null
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": [
-3.38919
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyY_ftPerSec2"
],
"value": [
-2.004781
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyX_ftPerSec2",
"accBodyY_ftPerSec2",
"accBodyZ_ftPerSec2"
],
"value": [
-1.234567,
4.56789,
-34.77694
]
}
]
}
Some code to beat it into a df such that each timestamp/sensor combo is a new row:一些代码将它打成一个 df,这样每个时间戳/传感器组合都是一个新行:
import json
import pandas as pd
def grab_json(json_filename):
with open(json_filename, 'r') as f:
json_str = f.read()
json_dict = json.loads(json_str)
resturn json_dict
def create_row_per_timestamp_and_sensor(data):
result = []
for sub_dict in data:
# Make sure we have an equal number of sensors/values
values = [None]*len(sub_dict['sensor']) if sub_dict['value'] is None else sub_dict['value']
# Zip and iterate over each sensor/value respectively
for sensor, value in zip(sub_dict['sensor'], values):
result.append({'ts': sub_dict['ts'],
'sensor': sensor,
'value': value})
return result
json_dict = grab_json("df.json") # instead of "df.json" put your filename instead
df_list = create_row_per_timestamp_and_sensor(json_dict['State'])
new_df = pd.DataFrame(df_list)
print(new_df)
outputs:输出:
sensor ts value
0 accBodyX_ftPerSec2 2018-04-11T21:37:05.401Z NaN
1 accBodyX_ftPerSec2 2018-04-11T21:37:05.100Z NaN
2 accBodyY_ftPerSec2 2018-04-11T21:37:05.100Z NaN
3 accBodyX_ftPerSec2 2018-04-11T21:37:05.901Z -3.389190
4 accBodyY_ftPerSec2 2018-04-11T21:37:05.901Z -2.004781
5 accBodyX_ftPerSec2 2018-04-11T21:37:05.901Z -1.234567
6 accBodyY_ftPerSec2 2018-04-11T21:37:05.901Z 4.567890
7 accBodyZ_ftPerSec2 2018-04-11T21:37:05.901Z -34.776940
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.