简体   繁体   中英

pandas: dataframe from dict with comma separated values

I'm trying to create a DataFrame from a nested dictionary, where the values are in comma separated strings.

Each value is nested in a dict, such as:

dict = {"1":{
             "event":"A, B, C"},
        "2":{
             "event":"D, B, A, C"},
        "3":{
             "event":"D, B, C"}
        }

My desired output is:

    A   B   C   D
0   A   B   C   NaN
1   A   B   C   D
2   NaN B   C   D

All I have so far is converting the dict to dataframe and splitting the items in each list. But I'm not sure this is getting me any closer to my objective.

df = pd.DataFrame(dict)
Out[439]: 
           1           2        3
event  A, B, C  D, B, A, C  D, B, C

In [441]: df.loc['event'].str.split(',').apply(pd.Series)                                                                                                                                                                                 
Out[441]: 
   0   1   2    3
1  A   B   C  NaN
2  D   B   A    C
3  D   B   C  NaN

Any help is appreciated. Thanks

You can use a couple of comprehensions to massage the nested dict into a better format for creation of a DataFrame that flags if an entry for the column exists or not:

the_dict = {"1":{
             "event":"A, B, C"},
        "2":{
             "event":"D, B, A, C"},
        "3":{
             "event":"D, B, C"}
        }

df = pd.DataFrame([[{z:1 for z in y.split(', ')} for y in x.values()][0] for x in the_dict.values()])
>>> df

     A  B  C    D
0  1.0  1  1  NaN
1  1.0  1  1  1.0
2  NaN  1  1  1.0

Once you've made the DataFrame you can simply loop through the columns and convert the values that flagged the existence of the letter into a letter using the where method(below this does where NaN leave as NaN, otherwise it inserts the letter for the column):

for col in df.columns:
    df_mask = df[col].isnull()
    df[col]=df[col].where(df_mask,col)
>>> df

     A  B  C    D
0    A  B  C  NaN
1    A  B  C    D
2  NaN  B  C    D

Based on @merlin's suggestion you can go straight to the answer within the comprehension:

df = pd.DataFrame([[{z:z for z in y.split(', ')} for y in x.values()][0] for x in the_dict.values()])
>>> df
     A  B  C    D
0    A  B  C  NaN
1    A  B  C    D
2  NaN  B  C    D

From what you have(modified the split a little to strip the extra spaces) df1 , you can probably just stack the result and use pd.crosstab() on the index and value column:

df1 = df.loc['event'].str.split('\s*,\s*').apply(pd.Series) 

df2 = df1.stack().rename('value').reset_index()
pd.crosstab(df2.level_0, df2.value)

#   value   A   B   C   D
# level_0               
#       1   1   1   1   0
#       2   1   1   1   1
#       3   0   1   1   1

This is not exactly as you asked for, but I imagine you may prefer this to your desired output.

To get exactly what you are looking for, you can add an extra column which is equal to the value column above and then unstack the index that contains the values:

df2 = df1.stack().rename('value').reset_index()
df2['value2'] = df2.value
df2.set_index(['level_0', 'value']).drop('level_1', axis = 1).unstack(level = 1)

#         value2
#   value   A     B     C     D
# level_0               
#       1   A     B     C  None
#       2   A     B     C     D
#       3   None  B     C     D

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM