简体   繁体   English

从嵌套字典中的项目构造pandas DataFrame

[英]Construct pandas DataFrame from items in nested dictionary

Suppose I have a nested dictionary 'user_dict' with structure:假设我有一个带有结构的嵌套字典“user_dict”:

  • Level 1: UserId (Long Integer)级别 1:用户ID(长整数)
  • Level 2: Category (String)级别 2:类别(字符串)
  • Level 3: Assorted Attributes (floats, ints, etc..)第 3 级:各种属性(浮点数、整数等)

For example, an entry of this dictionary would be:例如,这本词典的一个条目是:

user_dict[12] = {
    "Category 1": {"att_1": 1, 
                   "att_2": "whatever"},
    "Category 2": {"att_1": 23, 
                   "att_2": "another"}}

each item in user_dict has the same structure and user_dict contains a large number of items which I want to feed to a pandas DataFrame, constructing the series from the attributes. user_dict中的每个项目user_dict具有相同的结构,而user_dict包含大量我想提供给 Pandas DataFrame 的项目,从属性构建系列。 In this case a hierarchical index would be useful for the purpose.在这种情况下,分层索引将用于此目的。

Specifically, my question is whether there exists a way to to help the DataFrame constructor understand that the series should be built from the values of the "level 3" in the dictionary?具体来说,我的问题是是否存在一种方法可以帮助 DataFrame 构造函数理解应该从字典中的“级别 3”的值构建系列?

If I try something like:如果我尝试类似的事情:

df = pandas.DataFrame(users_summary)

The items in "level 1" (the UserId's) are taken as columns, which is the opposite of what I want to achieve (have UserId's as index). “级别 1”(UserId 的)中的项目被视为列,这与我想要实现的(以 UserId 为索引)相反。

I know I could construct the series after iterating over the dictionary entries, but if there is a more direct way this would be very useful.我知道我可以在遍历字典条目后构建该系列,但如果有更直接的方法,这将非常有用。 A similar question would be asking whether it is possible to construct a pandas DataFrame from json objects listed in a file.一个类似的问题是询问是否可以从文件中列出的 json 对象构造一个 Pandas DataFrame。

A pandas MultiIndex consists of a list of tuples. Pandas MultiIndex 由元组列表组成。 So the most natural approach would be to reshape your input dict so that its keys are tuples corresponding to the multi-index values you require.因此,最自然的方法是重塑您的输入字典,使其键是与您需要的多索引值相对应的元组。 Then you can just construct your dataframe using pd.DataFrame.from_dict , using the option orient='index' :然后你可以使用pd.DataFrame.from_dict构建你的数据pd.DataFrame.from_dict ,使用选项orient='index'

user_dict = {12: {'Category 1': {'att_1': 1, 'att_2': 'whatever'},
                  'Category 2': {'att_1': 23, 'att_2': 'another'}},
             15: {'Category 1': {'att_1': 10, 'att_2': 'foo'},
                  'Category 2': {'att_1': 30, 'att_2': 'bar'}}}

pd.DataFrame.from_dict({(i,j): user_dict[i][j] 
                           for i in user_dict.keys() 
                           for j in user_dict[i].keys()},
                       orient='index')


               att_1     att_2
12 Category 1      1  whatever
   Category 2     23   another
15 Category 1     10       foo
   Category 2     30       bar

An alternative approach would be to build your dataframe up by concatenating the component dataframes:另一种方法是通过连接组件数据框来构建您的数据框:

user_ids = []
frames = []

for user_id, d in user_dict.iteritems():
    user_ids.append(user_id)
    frames.append(pd.DataFrame.from_dict(d, orient='index'))

pd.concat(frames, keys=user_ids)

               att_1     att_2
12 Category 1      1  whatever
   Category 2     23   another
15 Category 1     10       foo
   Category 2     30       bar

pd.concat accepts a dictionary. pd.concat接受字典。 With this in mind, it is possible to improve upon the currently accepted answer in terms of simplicity and performance by use a dictionary comprehension to build a dictionary mapping keys to sub-frames.考虑到这一点,通过使用字典理解来构建将键映射到子帧的字典,可以在简单性和性能方面改进当前接受的答案。

pd.concat({k: pd.DataFrame(v).T for k, v in user_dict.items()}, axis=0)

Or,要么,

pd.concat({
        k: pd.DataFrame.from_dict(v, 'index') for k, v in user_dict.items()
    }, 
    axis=0)

              att_1     att_2
12 Category 1     1  whatever
   Category 2    23   another
15 Category 1    10       foo
   Category 2    30       bar

So I used to use a for loop for iterating through the dictionary as well, but one thing I've found that works much faster is to convert to a panel and then to a dataframe.所以我过去也使用 for 循环来遍历字典,但我发现工作得更快的一件事是转换为面板,然后转换为数据帧。 Say you have a dictionary d假设你有一本字典 d

import pandas as pd
d
{'RAY Index': {datetime.date(2014, 11, 3): {'PX_LAST': 1199.46,
'PX_OPEN': 1200.14},
datetime.date(2014, 11, 4): {'PX_LAST': 1195.323, 'PX_OPEN': 1197.69},
datetime.date(2014, 11, 5): {'PX_LAST': 1200.936, 'PX_OPEN': 1195.32},
datetime.date(2014, 11, 6): {'PX_LAST': 1206.061, 'PX_OPEN': 1200.62}},
'SPX Index': {datetime.date(2014, 11, 3): {'PX_LAST': 2017.81,
'PX_OPEN': 2018.21},
datetime.date(2014, 11, 4): {'PX_LAST': 2012.1, 'PX_OPEN': 2015.81},
datetime.date(2014, 11, 5): {'PX_LAST': 2023.57, 'PX_OPEN': 2015.29},
datetime.date(2014, 11, 6): {'PX_LAST': 2031.21, 'PX_OPEN': 2023.33}}}

The command命令

pd.Panel(d)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 4 (minor_axis)
Items axis: RAY Index to SPX Index
Major_axis axis: PX_LAST to PX_OPEN
Minor_axis axis: 2014-11-03 to 2014-11-06

where pd.Panel(d)[item] yields a dataframe其中 pd.Panel(d)[item] 产生一个数据框

pd.Panel(d)['SPX Index']
2014-11-03  2014-11-04  2014-11-05 2014-11-06
PX_LAST 2017.81 2012.10 2023.57 2031.21
PX_OPEN 2018.21 2015.81 2015.29 2023.33

You can then hit the command to_frame() to turn it into a dataframe.然后,您可以点击命令 to_frame() 将其转换为数据帧。 I use reset_index as well to turn the major and minor axis into columns rather than have them as indices.我也使用 reset_index 将长轴和短轴转换为列,而不是将它们作为索引。

pd.Panel(d).to_frame().reset_index()
major   minor      RAY Index    SPX Index
PX_LAST 2014-11-03  1199.460    2017.81
PX_LAST 2014-11-04  1195.323    2012.10
PX_LAST 2014-11-05  1200.936    2023.57
PX_LAST 2014-11-06  1206.061    2031.21
PX_OPEN 2014-11-03  1200.140    2018.21
PX_OPEN 2014-11-04  1197.690    2015.81
PX_OPEN 2014-11-05  1195.320    2015.29
PX_OPEN 2014-11-06  1200.620    2023.33

Finally, if you don't like the way the frame looks you can use the transpose function of panel to change the appearance before calling to_frame() see documentation here http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Panel.transpose.html最后,如果您不喜欢框架的外观,您可以在调用 to_frame() 之前使用面板的转置功能更改外观,请参阅此处的文档http://pandas.pydata.org/pandas-docs/dev/generated /pandas.Panel.transpose.html

Just as an example举个例子

pd.Panel(d).transpose(2,0,1).to_frame().reset_index()
major        minor  2014-11-03  2014-11-04  2014-11-05  2014-11-06
RAY Index   PX_LAST 1199.46    1195.323     1200.936    1206.061
RAY Index   PX_OPEN 1200.14    1197.690     1195.320    1200.620
SPX Index   PX_LAST 2017.81    2012.100     2023.570    2031.210
SPX Index   PX_OPEN 2018.21    2015.810     2015.290    2023.330

Hope this helps.希望这可以帮助。

In case someone wants to get the data frame in a "long format" (leaf values have the same type) without multiindex, you can do this:如果有人想要在没有多索引的情况下以“长格式”(叶值具有相同类型)获取数据框,您可以这样做:

pd.DataFrame.from_records(
    [
        (level1, level2, level3, leaf)
        for level1, level2_dict in user_dict.items()
        for level2, level3_dict in level2_dict.items()
        for level3, leaf in level3_dict.items()
    ],
    columns=['UserId', 'Category', 'Attribute', 'value']
)

    UserId    Category Attribute     value
0       12  Category 1     att_1         1
1       12  Category 1     att_2  whatever
2       12  Category 2     att_1        23
3       12  Category 2     att_2   another
4       15  Category 1     att_1        10
5       15  Category 1     att_2       foo
6       15  Category 2     att_1        30
7       15  Category 2     att_2       bar

(I know the original question probably wants (I.) to have Levels 1 and 2 as multiindex and Level 3 as columns and (II.) asks about other ways than iteration over values in the dict. But I hope this answer is still relevant and useful (I.): to people like me who have tried to find a way to get the nested dict into this shape and google only returns this question and (II.): because other answers involve some iteration as well and I find this approach flexible and easy to read; not sure about performance, though.) (我知道最初的问题可能希望 (I.) 将级别 1 和 2 作为多索引,将级别 3 作为列,并且 (II.) 询问除 dict 中的值迭代之外的其他方式。但我希望这个答案仍然相关和有用的(I.):对于像我这样试图找到一种方法将嵌套的字典变成这种形状的人,谷歌只返回这个问题和(II。):因为其他答案也涉及一些迭代,我发现这个方法灵活且易于阅读;但不确定性能。)

For other ways to represent the data, you don't need to do much.对于表示数据的其他方式,您不需要做太多事情。 For example, if you just want the "outer" key to be an index, the "inner" key to be columns and the values to be cell values, this would do the trick:例如,如果您只想将“外部”键作为索引,将“内部”键作为列,将值作为单元格值,则可以这样做:

df = pd.DataFrame.from_dict(user_dict, orient='index')


Building on verified answer, for me this worked best:基于经过验证的答案,对我来说这最有效:

ab = pd.concat({k: pd.DataFrame(v).T for k, v in data.items()}, axis=0)
ab.T

This solution should work for arbitrary depth by flattening dictionary keys to a tuple chain此解决方案应该通过将字典键展平为元组链来适用于任意深度

def flatten_dict(nested_dict):
    res = {}
    if isinstance(nested_dict, dict):
        for k in nested_dict:
            flattened_dict = flatten_dict(nested_dict[k])
            for key, val in flattened_dict.items():
                key = list(key)
                key.insert(0, k)
                res[tuple(key)] = val
    else:
        res[()] = nested_dict
    return res


def nested_dict_to_df(values_dict):
    flat_dict = flatten_dict(values_dict)
    df = pd.DataFrame.from_dict(flat_dict, orient="index")
    df.index = pd.MultiIndex.from_tuples(df.index)
    df = df.unstack(level=-1)
    df.columns = df.columns.map("{0[1]}".format)
    return df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM