简体   繁体   中英

pandas join not working in for loop after first iteration

I am trying to explode a nested list inside a dict by the help of pandas. In a loop I join the list with every row. Strangely in the second iteration the join seems not to work properly. Perhaps there is something fundamental I do not understand about pandas, but I can't figure out why the join in the iteration only works on the first iteration and in the following ones the join does not work, the end the result looks like this:

     key  amount   id  key_r   code   name  key_l
0.0    0    12.0  1.0    0.0    NaN    NaN    NaN
1.0    0    23.0  NaN    0.0    NaN    NaN    NaN
NaN    1     NaN  NaN    NaN  test2  test2    0.0

instead of from the first iteration (strangely the key becomes 1 in the first operation):

   key  amount   id  key_r   code   name  key_l
0    0      12  1.0      0  test1  test1      0
1    0      23  NaN      0  test1  test1      0

Code:

data = [
{
    "code": "test1",
    "name": "test1",
    "sub_list": [
        {"amount": 10, "id": 2},
        {"amount": 20},
    ],
},
{
    "code": "test2",
    "name": "test2",
    "sub_list": [
        {"amount": 12, "id": 1},
        {"amount": 23},
    ],
}

]
data_df = pd.DataFrame(data)
for ix, row in data_df.iterrows():
    sub_list_df = pd.DataFrame(row['sub_list'])
    row_df = row.to_frame().transpose()
    main_df = row_df.loc[:, row_df.columns != 'sub_list']
    main_df['key'] = 0
    sub_list_df['key'] = 0
    print(main_df)
    print(sub_list_df)
    tmp_df = sub_list_df.join(main_df, on=['key'], how="outer", lsuffix="_r", rsuffix="_l")
    print(tmp_df)

Any advice?

Here's a simpler way to do without using explicit for loop:

# explode the dict
f = data_df.explode('sub_list')

# convert exploded dict into separate columns
f = pd.concat([f, f['sub_list'].apply(pd.Series)], axis=1).drop('sub_list', axis=1)

print(f)

    code   name  amount   id
0  test1  test1    10.0  2.0
0  test1  test1    20.0  NaN
1  test2  test2    12.0  1.0
1  test2  test2    23.0  NaN

The problem is that value 1 is not coming from your dataframe column "key". Instead it takes the index values which are 0 and 1 since you have two rows. one of the solutions to solve this is to set the key column as index in both dataframes. for example:

main_df['_key'] = 0
sub_list_df['_key'] = 0
tmp_df = sub_list_df.set_index('_key').join(main_df.set_index('_key'), on='_key', how="outer")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM