简体   繁体   中英

Looking for an efficient way to reorganize a pandas Dataframe

I have data in a Dataframe structured like so...

       D           K         Factor        p
0   -0.483128   -1.240024  -1.214765   -1.002418
1   -0.692334   -1.632132   1.562630    0.997304
2   -1.189383   -1.632132   1.562630    0.997304
3   -1.691841   -1.632132   1.562630    0.997304
4   -2.084926   -1.632132   1.562630    0.997304

I'm trying to reorganize the data into a new structure where each row contains 'period' number of rows from the existing data. Moves ahead one row in existing and the stacks the next 'period lines'.

the function so far:

def prepData(seq, period):
    newStacks = pd.DataFrame()
    for pos in range(0, len(seq) - (period+1), 1):
        chunk = (seq[pos:pos + period])
        stack = []
        for sliver in range(0, len(chunk), 1):
            piece = (chunk.iloc[sliver:])
            print(piece)
            stack.append(piece) 
        
        newStacks.append(chunk)

return newStacks

this obviously is not efficient and aside doesn't produce the desire structure. The aim is to get a structure like below considering period = 3

0   -0.483128   -1.240024   -1.214765   -1.002418   -0.692334   -1.632132   1.562630    0.997304 -1.189383  -1.632132   1.562630    0.997304
1   -0.692334   -1.632132   1.562630    0.997304    -1.189383   -1.632132   1.562630    0.997304 -1.691841  -1.632132   1.562630    0.997304
2   -1.189383   -1.632132   1.562630    0.997304    -1.691841   -1.632132   1.562630    0.997304 -2.084926  -1.632132   1.562630    0.997304

AA Simple way to accomplish this would be appreciated.

I am not sure if you want to create a dataframe or a list. Here's the code to get you both.

import pandas as pd
c = ['D','K','Factor','p']
d = [[-0.483128,   -1.240024,  -1.214765,   -1.002418],
[-0.692334,   -1.632132,   1.562630,    0.997304],
[-1.189383,   -1.632132,   1.562630,    0.997304],
[-1.691841,   -1.632132,   1.562630,    0.997304],
[-2.084926,   -1.632132,   1.562630,    0.997304]]
df = pd.DataFrame(d,columns=c)
print (df)

p = 3 #this is the period you wanted. I set it to 3
stack_list = []   #this will store the final stacked list
                  #note: don't use stack, its used by pandas to stack

for i in range(len(df)-p+1):  #iterating thru the dataframe
    # convert p rows to a list after you stack them
    chunk = df.loc[i:i+p-1].stack().reset_index(level=1,drop=True).tolist()

    stack_list.append(chunk)   #store chunk to stack_list


df1 = pd.DataFrame(stack_list)   #creating a dataframe as per your request

#printing both stack_list and dataframe

print (stack_list)
print (df1)

The output of this will be:

The original dataframe is:

          D         K    Factor         p
0 -0.483128 -1.240024 -1.214765 -1.002418
1 -0.692334 -1.632132  1.562630  0.997304
2 -1.189383 -1.632132  1.562630  0.997304
3 -1.691841 -1.632132  1.562630  0.997304
4 -2.084926 -1.632132  1.562630  0.997304

The stacked list is:

[[-0.483128, -1.240024, -1.214765, -1.002418, -0.692334, -1.632132, 1.56263, 0.997304, -1.189383, -1.632132, 1.56263, 0.997304], 
 [-0.692334, -1.632132, 1.56263, 0.997304, -1.189383, -1.632132, 1.56263, 0.997304, -1.691841, -1.632132, 1.56263, 0.997304], 
 [-1.189383, -1.632132, 1.56263, 0.997304, -1.691841, -1.632132, 1.56263, 0.997304, -2.084926, -1.632132, 1.56263, 0.997304]]

The new dataframe that you wanted to get created is:

         0         1         2   ...        9        10        11
0 -0.483128 -1.240024 -1.214765  ... -1.632132  1.56263  0.997304
1 -0.692334 -1.632132  1.562630  ... -1.632132  1.56263  0.997304
2 -1.189383 -1.632132  1.562630  ... -1.632132  1.56263  0.997304

If there are all floats columns for improve performance use strides in numpy with reshape 3d array to 2d and pass to DataFrame constructor:

#https://stackoverflow.com/a/44306231/2901002 a bit changed
def strided_lastaxis(a, L):
    s0,s1 = a.strides
    m,n = a.shape
    return np.lib.stride_tricks.as_strided(a, shape=(m-L+1,L,n), strides=(s0,s0,s1))

a = strided_lastaxis(df.to_numpy(), 3)

df1 = pd.DataFrame(a.reshape(a.shape[0], -1))
print (df1)
         0         1         2         3         4         5        6   \
0 -0.483128 -1.240024 -1.214765 -1.002418 -0.692334 -1.632132  1.56263   
1 -0.692334 -1.632132  1.562630  0.997304 -1.189383 -1.632132  1.56263   
2 -1.189383 -1.632132  1.562630  0.997304 -1.691841 -1.632132  1.56263   

         7         8         9        10        11  
0  0.997304 -1.189383 -1.632132  1.56263  0.997304  
1  0.997304 -1.691841 -1.632132  1.56263  0.997304  
2  0.997304 -2.084926 -1.632132  1.56263  0.997304  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM