简体   繁体   中英

How do I extract data in JSON in a Pandas column and structure it by stacking it?

Suppose I have the following DataFame:

df = pd.DataFrame(
    {
        'id':[112, 130, 200],
        'name':['Lara', 'Kasia', 'Rocco'],
        'level/income':[{1:400, 2:800}, {1:100}, {1:1000, 2:2000, 3:5000}]
    }
)

df

I would like this to be the end result:

df2

I tried to do this on my own but I failed spectacularly. I was trying to recreate the original DataFrame by finding which id and name would appear multiple times but this proved too much.

len_need = []
id2 = []
name2 = []
level2 = []
income2 = []

for index,row in df.iterrows():
    len_need.append(len(row['level/income']))

I'm sure there's a better way to do this.

try this:

(df.set_index(['id', 'name'])
 .squeeze()
 .apply(pd.Series)
 .rename_axis('level', axis=1)
 .stack()
 .reset_index(name='income'))

>>>
    id  name    level   income
0   112 Lara    1       400.0
1   112 Lara    2       800.0
2   130 Kasia   1       100.0
3   200 Rocco   1       1000.0
4   200 Rocco   2       2000.0
5   200 Rocco   3       5000.0

Variation using json_normalize :

from pandas import DataFrame, json_normalize

df = DataFrame(
    {
        "id": [112, 130, 200],
        "name": ["Lara", "Kasia", "Rocco"],
        "level/income": [{1: 400, 2: 800}, {1: 100}, {1: 1000, 2: 2000, 3: 5000}],
    }
)

income_cols = list(range(1, 3 + 1))
id_cols = ["id", "name"]

final = (
    df["level/income"]
    .pipe(json_normalize)
    .assign(id=df["id"], name=df["name"])
    .melt(
        id_vars=id_cols, value_vars=income_cols, var_name="level", value_name="income"
    )
    .dropna()
)

print(final)
#     id   name level  income
# 0  112   Lara     0   400.0
# 1  130  Kasia     0   100.0
# 2  200  Rocco     0  1000.0
# 3  112   Lara     1   800.0
# 5  200  Rocco     1  2000.0
# 8  200  Rocco     2  5000.0

I would extract the level/income column, explode it, then join back:

# this remove the `level/income` column from the dataframe
s = df.pop('level/income')

df.join(pd.DataFrame(s.tolist(), index=s.index)
            .rename_axis(columns='level')
            .stack()
            .reset_index(level='level', name='income')
       )

Output:

    id   name  level  income
0  112   Lara      1   400.0
0  112   Lara      2   800.0
1  130  Kasia      1   100.0
2  200  Rocco      1  1000.0
2  200  Rocco      2  2000.0
2  200  Rocco      3  5000.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM