简体   繁体   中英

Clean way of slicing + stacking pandas dataframe

I have a Pandas DataFrame, say df, which is 1099 lines by 33 rows. I need the original file to be processed by another software, but it is not in the proper format. This is why I'm trying to get the good format whith pandas.

The problem is very simple: df is constituted by columns of identifiers (7 in the real case, only 3 in the following example), and then by corresponding results by months. To be clear, it's like

A    B    C    date1result  date2result  date2result
a1   b1   c1       12           15           17
a2   b2   c3        5            8            3

But to be processed, I would need it to have one line per result, adding a column for the date. In the given example, it would be

A    B    C      result       date  
a1   b1   c1       12         date1 
a1   b1   c1       15         date2
a1   b1   c1       17         date3
a2   b2   c3        5         date1
a2   b2   c3        8         date2
a2   b2   c3        3         date3

So to be more precise, I have edited manually all column names with date (after the read_excel, the looked like '01/01/2015 0:00:00' or something like that, and I was unable to access them... As a secondary question, does anyone knows how to access columns being imported from a date field in an .xlsx?), so that date column names are now 2015_01, 2015_02... 2015_12, 2016_01, ..., 2016_12, the 5 first being 'Account','Customer Name','Postcode','segment' and 'Rep'. So I tried the following code:

core = df.loc[:,('Account','Customer Name','Postcode','segment','Rep')]

df_final=pd.Series([])
for year in [2015,2016]:
    for month in range(1, 13):
        label = "%i_%02i" % (year,month)
        date = []
        for i in range(core.shape[0]):
            date.append("01/%02i/%i"%(month,year))  
        df_date=pd.Series(date) #I don't know to create this 1xn df
        df_final = df_final.append(pd.concat([core, df[label], df_date], axis=1))

That works roughly, but it is very unclean: I get a (26376, 30) shaped df_final, fist column being the dates, then the results, but of course with '2015_01' as column name, then all the '2015_02' through '2016_12' filled by NaN, and at last my Account', 'Customer Name', 'Postcode', 'segment' and 'Rep' columns. Does anyone know how I could do such a "slicing+stacking" in a clean way?

Thank you very much.

Edit: it is roughly the reverse of this question: Stacking and shaping slices of DataFrame (pandas) without looping

Ithink you need melt :

df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
print (df)
    A   B   C         date  result
0  a1  b1  c1  date1result      12
1  a2  b2  c3  date1result       5
2  a1  b1  c1  date2result      15
3  a2  b2  c3  date2result       8
4  a1  b1  c1  date3result      17
5  a2  b2  c3  date3result       3

And then convert to_datetime :

print (df)
    A   B   C  2015_01  2016_10  2016_12
0  a1  b1  c1       12       15       17
1  a2  b2  c3        5        8        3

df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
df.date = pd.to_datetime(df.date, format='%Y_%m')
print (df)
    A   B   C       date  result
0  a1  b1  c1 2015-01-01      12
1  a2  b2  c3 2015-01-01       5
2  a1  b1  c1 2016-10-01      15
3  a2  b2  c3 2016-10-01       8
4  a1  b1  c1 2016-12-01      17
5  a2  b2  c3 2016-12-01       3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM