简体   繁体   中英

Reshape Pandas dataframe columns by block of N columns

I have 1 dataframe where blocks of columns need to be reshaped to rows. I tried to use stack() and melt() but could not manage to find the right way.

Here is an example of what I expect:

data = {'id':['a1', 'a2', 'a3', 'a4'], 
        'year':[20, 20, 19, 18],
        'b_A': [1, 2, 3, 4],
        'b_B': [5, 6, 7, 8],
        'b_C': [9, 10, 11, 12],
        'c_A': [13, 14, 15, 16],
        'c_B': [17, 18, 19, 20],
        'c_C': [21, 22, 23, 24],
        'd_A': [25, 26, 27, 28],
        'd_B': [29, 30, 31, 32],
        'd_C': [33, 34, 35, 36],
        } 
  
df = pd.DataFrame(data) 
    id  year   b_A  b_B b_C c_A c_B c_C d_A d_B d_C
0   a1  20     1    5   9   13  17  21  25  29  33
1   a2  20     2    6   10  14  18  22  26  30  34
2   a3  19     3    7   11  15  19  23  27  31  35
3   a4  18     4    8   12  16  20  24  28  32  36

The expected result should be:

    id  year    origin  A   B   C
0   a1  20      b       1   5   9
1   a1  20      c       13  17  21
2   a1  20      d       25  29  33
3   a2  20      b       2   6   10
4   a2  20      c       14  18  22
5   a2  20      d       26  30  34
6   a3  19      b       3   7   11
7   a3  19      c       15  19  23
8   a3  19      d       27  31  35
9   a4  18      b       4   8   12
10  a4  18      c       16  20  24
11  a4  18      d       28  32  36

Thanks for your time and help.

You can convert non columns names with _ to index by DataFrame.set_index , then splitting columns by Series.str.split and reshape by DataFrame.stack :

df1 = df.set_index(['id','year'])
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack(level=0).reset_index()
print (df1)
    id  year level_2   A   B   C
0   a1    20       b   1   5   9
1   a1    20       c  13  17  21
2   a1    20       d  25  29  33
3   a2    20       b   2   6  10
4   a2    20       c  14  18  22
5   a2    20       d  26  30  34
6   a3    19       b   3   7  11
7   a3    19       c  15  19  23
8   a3    19       d  27  31  35
9   a4    18       b   4   8  12
10  a4    18       c  16  20  24
11  a4    18       d  28  32  36

If need also set column origin is possible use DataFrame.rename_axis :

df1 = df.set_index(['id','year'])
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.rename_axis(['origin',None], axis=1).stack(0).reset_index()
print (df1)
    id  year origin   A   B   C
0   a1    20      b   1   5   9
1   a1    20      c  13  17  21
2   a1    20      d  25  29  33
3   a2    20      b   2   6  10
4   a2    20      c  14  18  22
5   a2    20      d  26  30  34
6   a3    19      b   3   7  11
7   a3    19      c  15  19  23
8   a3    19      d  27  31  35
9   a4    18      b   4   8  12
10  a4    18      c  16  20  24
11  a4    18      d  28  32  36

Or use wide_to_long with change order of values with _ like A_b to b_A :

df.columns = [f'{"_".join(x[::-1])}' for x in df.columns.str.split('_')]
df1 = pd.wide_to_long(df, 
                      stubnames=['A','B','C'],
                      i=['id','year'], 
                      j='origin', 
                      sep='_',
                      suffix=r'\w+').reset_index()
print (df1)
    id  year origin   A   B   C
0   a1    20      b   1   5   9
1   a1    20      c  13  17  21
2   a1    20      d  25  29  33
3   a2    20      b   2   6  10
4   a2    20      c  14  18  22
5   a2    20      d  26  30  34
6   a3    19      b   3   7  11
7   a3    19      c  15  19  23
8   a3    19      d  27  31  35
9   a4    18      b   4   8  12
10  a4    18      c  16  20  24
11  a4    18      d  28  32  36

You could also use pivot_longer function from pyjanitor ; at the moment you have to install the latest development version from github :

 # install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
 import janitor

df.pivot_longer(index=["id", "year"], 
                names_to=("origin", ".value"), 
                names_sep="_")

    id  year    origin  A   B   C
0   a1  20  b   1   5   9
1   a2  20  b   2   6   10
2   a3  19  b   3   7   11
3   a4  18  b   4   8   12
4   a1  20  c   13  17  21
5   a2  20  c   14  18  22
6   a3  19  c   15  19  23
7   a4  18  c   16  20  24
8   a1  20  d   25  29  33
9   a2  20  d   26  30  34
10  a3  19  d   27  31  35
11  a4  18  d   28  32  36

The names_sep value splits the columns; the split values that pair with .value remain as column headers, while the other values are lumped underneath the origin column.

if you want the data in order of appearance, you can use the sort_by_appearance parameter:

df.pivot_longer(
    index=["id", "year"],
    names_to=("origin", ".value"),
    names_sep="_",
    sort_by_appearance=True,
)


    id  year    origin  A   B   C
0   a1  20  b   1   5   9
1   a1  20  c   13  17  21
2   a1  20  d   25  29  33
3   a2  20  b   2   6   10
4   a2  20  c   14  18  22
5   a2  20  d   26  30  34
6   a3  19  b   3   7   11
7   a3  19  c   15  19  23
8   a3  19  d   27  31  35
9   a4  18  b   4   8   12
10  a4  18  c   16  20  24
11  a4  18  d   28  32  36

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM