简体   繁体   中英

How to convert long data format to wide data format in pandas

I have data in the long format and would like to convert as wide format with sparse data:

df = pd.DataFrame({'id':[1000,1001,1000,1002,1003,1000],
                   'Date':[26018,26017,26016,26010,26009,26009]})

输入数据

And it is to be converted as wide format by assigning ones:

输出数据

Idea is add helper column and pivoting by DataFrame.pivot , add missing values by DataFrame.reindex and last convert misisng values to 0 with casting to integers:

df1 = (df.assign(new = 1)
         .pivot('id','Date','new')
         .reindex(range(df['Date'].min(), df['Date'].max() + 1), axis=1)
         .fillna(0)
         .astype(int))
print (df1)
Date  26009  26010  26011  26012  26013  26014  26015  26016  26017  26018
id                                                                        
1000      1      0      0      0      0      0      0      1      0      1
1001      0      0      0      0      0      0      0      0      1      0
1002      0      1      0      0      0      0      0      0      0      0
1003      1      0      0      0      0      0      0      0      0      0

If get:

ValueError: Index contains duplicate entries, cannot reshape

it means there are duplicates like in last row of changed sample data. Then add DataFrame.drop_duplicates :

df = pd.DataFrame({'id':[1000,1001,1000,1002,1003,1000, 1000],
                   'Date':[26018,26017,26016,26010,26009,26009, 26009]})


df1 = (df.assign(new = 1)
         .drop_duplicates(subset=['id','Date'])
         .pivot('id','Date','new')
         .reindex(range(df['Date'].min(), df['Date'].max() + 1), axis=1)
         .fillna(0)
         .astype(int))
print (df1)
Date  26009  26010  26011  26012  26013  26014  26015  26016  26017  26018
id                                                                        
1000      1      0      0      0      0      0      0      1      0      1
1001      0      0      0      0      0      0      0      0      1      0
1002      0      1      0      0      0      0      0      0      0      0
1003      1      0      0      0      0      0      0      0      0      0

Notice: My solution not counts rows, only return 1 if exist intersection id and Date else 0 .

You can use crosstab in case you are counting the frequency also

df1 = pd.crosstab(df['id'], df['Date']).reindex(range(df['Date'].min(), df['Date'].max() + 1), axis=1).fillna(0).astype(int) 

Output of df1:

Date  26009  26010  26011  26012  26013  26014  26015  26016  26017  26018
id                                                                        
1000      1      0      0      0      0      0      0      1      0      1
1001      0      0      0      0      0      0      0      0      1      0
1002      0      1      0      0      0      0      0      0      0      0
1003      1      0      0      0      0      0      0      0      0      0

You may take advantage of groupby.size and unstack . Finally, reindex as other solutions

ix = range(df.Date.min(), df.Date.max()+1)
df_final = (df.groupby(['id', 'Date']).size()
              .unstack(fill_value=0)
              .reindex(ix, axis=1, fill_value=0))

Out[205]:
Date  26009  26010  26011  26012  26013  26014  26015  26016  26017  26018
id
1000      1      0      0      0      0      0      0      1      0      1
1001      0      0      0      0      0      0      0      0      1      0
1002      0      1      0      0      0      0      0      0      0      0
1003      1      0      0      0      0      0      0      0      0      0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM