简体   繁体   English

用 JSON Z497031794414A552435F90151AC3B54 展开 Pandas DataFrame 列

[英]Expand Pandas DataFrame Column with JSON Object

I'm looking for a clean, fast way to expand a pandas dataframe column which contains a json object (essentially a dict of nested dicts), so I could have one column for each element in the json column in json normalized form; I'm looking for a clean, fast way to expand a pandas dataframe column which contains a json object (essentially a dict of nested dicts), so I could have one column for each element in the json column in json normalized form; however, this needs to retain all of the original dataframe columns as well.但是,这也需要保留所有原始 dataframe 列。 In some instances, this dict might have a common identifier I could use to merge with the original dataframe, but not always.在某些情况下,这个 dict 可能有一个通用标识符,我可以用来与原始 dataframe 合并,但并非总是如此。 For example:例如:

import pandas as pd
import numpy as np
df = pd.DataFrame([
    {
        'col1': 'a',
        'col2': {'col2.1': 'a1', 'col2.2': {'col2.2.1': 'a2.1', 'col2.2.2': 'a2.2'}},
        'col3': '3a'
    },
    {
        'col1': 'b',
        'col2': np.nan,
        'col3': '3b'
    },
    {
        'col1': 'c',
        'col2': {'col2.1': 'c1', 'col2.2': {'col2.2.1': np.nan, 'col2.2.2': 'c2.2'}},
        'col3': '3c'
    }
])

Here is a sample dataframe.这是一个示例 dataframe。 As you can see, col2 is a dict in all of these cases which has another nested dict inside of it, or could be a null value, containing nested elements I would like to be able to access.如您所见, col2 在所有这些情况下都是一个字典,其中包含另一个嵌套字典,或者可能是 null 值,包含我希望能够访问的嵌套元素。 (For the nulls, I would want to be able to handle them at any level--entire elements in the dataframe, or just specific elements in the row.) In this case, they have no ID that could link up to the original dataframe. (对于空值,我希望能够在任何级别处理它们——dataframe 中的整个元素,或者只是行中的特定元素。)在这种情况下,它们没有可以链接到原始 dataframe 的 ID . My end goal would be essentially to have this:我的最终目标基本上是拥有这个:

final = pd.DataFrame([
    {
        'col1': 'a',
        'col2.1': 'a1',
        'col2.2.col2.2.1': 'a2.1',
        'col2.2.col2.2.2': 'a2.2',
        'col3': '3a'
    },
    {
        'col1': 'b',
        'col2.1': np.nan,
        'col2.2.col2.2.1': np.nan,
        'col2.2.col2.2.2': np.nan,
        'col3': '3b'
    },
    {
        'col1': 'c',
        'col2.1': 'c1',
        'col2.2.col2.2.1': np.nan,
        'col2.2.col2.2.2': 'c2.2',
        'col3': '3c'
    }
])

In my instance, the dict could have up to 50 nested key-value pairs, and I might only need to access a few of them.在我的例子中,dict 最多可以有 50 个嵌套的键值对,我可能只需要访问其中的几个。 Additionally, I have about 50 - 100 other columns of data I need to preserve with these new columns (so an end goal of around 100 - 150).此外,我还有大约 50 - 100 个其他数据列需要与这些新列一起保存(因此最终目标约为 100 - 150)。 So I suppose there might be two methods I'd be looking for--getting a column for each value in the dict, or getting a column for a select few.所以我想我可能会寻找两种方法——为字典中的每个值获取一列,或者为少数 select 获取一列。 The former option I haven't yet found a great workaround for;前一个选项我还没有找到一个很好的解决方法; I've looked at some prior answers but found them to be rather confusing, and most threw errors.我查看了一些先前的答案,但发现它们相当混乱,并且大多数都抛出了错误。 This seems especially difficult when there are dicts nested inside of the column.当列内嵌套有字典时,这似乎特别困难。 To attempt the second solution, I tried the following code:为了尝试第二种解决方案,我尝试了以下代码:

def get_val_from_dict(row, col, label):
    if pd.isnull(row[col]):
        return np.nan
    
    norm = pd.json_normalize(row[col])
    
    try:
        return norm[label]
    except:
        return np.nan


needed_cols = ['col2.1', 'col2.2.col2.2.1', 'col2.2.col2.2.2']


for label in needed_cols:
    df[label] = df.apply(get_val_from_dict, args = ('col2', label), axis = 1)

This seemed to work for this example, and I'm perfectly happy with the output, but for my actual dataframe which had substantially more data, this seemed a bit slow--and, I would imagine, is not a great or scalable solution.这似乎适用于这个例子,我对 output 非常满意,但对于我实际的 dataframe 来说,它有更多的数据,这似乎有点慢——而且,我想,这不是一个很好的或可扩展的解决方案。 Would anyone be able to offer an alternative to this sluggish approach to resolving the issue I'm having?有没有人能够提供一种替代这种缓慢方法来解决我遇到的问题的方法?

(Also, apologies also about the massive amounts of nesting in my naming here. If helpful, I am adding in several images of the dataframes below--the original, then the target, and then the current output.) (另外,对于我在这里命名中的大量嵌套表示歉意。如果有帮助,我将在下面添加几张数据帧的图像——原始的,然后是目标的,然后是当前的 output。)

原始df 目标df 电流输出

instead of using apply or pd.json_normalize on the column that has a dictionary, convert the whole data frame to dictionary & use pd.json_normalize on it, finally picking the fields you wish to keep.不要在具有字典的列上使用applypd.json_normalize ,而是将整个数据框转换为字典并在其上使用pd.json_normalize ,最后选择您希望保留的字段。 This works because while the individual column for any given row may be null, the entire row would not be.这是有效的,因为虽然任何给定行的单个列可能是 null,但整行不会。

example:例子:

# note that this method also prefixes an extra `col2.` 
# at the start of the names of the denested data, 
# which is not present in the example output
# the column renaming conforms to your desired name.
import re
final_cols = ['col1', 'col2.col2.1', 'col2.col2.2.col2.2.1', 'col2.col2.2.col2.2.2', 'col3']
out = pd.json_normalize(df.to_dict(orient='records'))[final_cols]
out.rename(columns=lambda x: re.sub(r'^col2\.', '', x), inplace=True)
out
# out:
  col1 col2.1 col2.2.col2.2.1 col2.2.col2.2.2 col3
0    a     a1            a2.1            a2.2   3a
1    b    NaN             NaN             NaN   3b
2    c     c1             NaN            c2.2   3c

but for my actual dataframe which had substantially more data, this was quite slow但是对于我的实际 dataframe 具有更多数据,这非常慢

Right now I have 1000 rows of data, each row has about 100 columns, and then the column I want to expand has about 50 nested key/value pairs in it.现在我有 1000 行数据,每行大约有 100 列,然后我要扩展的列中有大约 50 个嵌套的键/值对。 I would expect that the data could scale up to 100k rows with the same number of columns over the next year or so, and so I'm hoping to have a scalable process ready to go at that point我希望在接下来的一年左右,数据可以扩展到具有相同列数的 100k 行,因此我希望届时可以为 go 准备好一个可扩展的过程

pd.json_normalize should be faster than your attempt, but it is not faster than doing the flattening in pure python, so you might get more performance if you wrote a custom transform function & constructed the dataframe as below. pd.json_normalize应该比您的尝试快,但它并不比在纯 python 中进行展平更快,因此如果您编写自定义transform function 并构造如下 Z6A8064B5DF47945550DZCC55555555,您可能会获得更高的性能。

out = pd.DataFrame(transform(x) for x in df.to_dict(orient='records'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM