Python datatable/pandas reshaping problem

Question

I need to reshape my df.

This is my input df:

import pandas as pd
import datatable as dt

DF_in = dt.Frame(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
             date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
             type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
             value=[1, 2, 3, 4, 5, 6, 7, 8])

   | name   date        type  value
-- + -----  ----------  ----  -----
 0 | name1  2021-01-01  a         1
 1 | name1  2021-01-02  b         2
 2 | name1  2021-01-03  a         3
 3 | name1  2021-01-04  b         4
 4 | name2  2021-01-05  b         5
 5 | name2  2021-01-06  a         6
 6 | name2  2021-01-07  b         7
 7 | name2  2021-01-08  a         8

This is the desired output df:

DF_out = dt.Frame(name=['name1', 'name1', 'name2', 'name2'],
              date_a=['2021-01-01', '2021-01-03', '2021-01-06', '2021-01-08'],
              date_b=['2021-01-02', '2021-01-04', '2021-01-07', None],
              value_a=[1, 3, 6, 8],
              value_b=[2, 4, 7, None])

   | name   date_a      date_b      value_a  value_b
-- + -----  ----------  ----------  -------  -------
 0 | name1  2021-01-01  2021-01-02        1        2
 1 | name1  2021-01-03  2021-01-04        3        4
 2 | name2  2021-01-06  2021-01-07        6        7
 3 | name2  2021-01-08  NA                8       NA

If necessary the datatable Frames can be converted into a pandas DataFrame:

DF_in = DF_in.to_pandas()

Transformation:

This is a grouped transformation. The grouping column is 'name'.
The df is already sorted
The number of rows in each group is different and can be even or uneven
If the first row in a group has a 'b' in the column 'type' it has to be removed (example: row 4 in DF_in)
It is also possible that the last row in a group has an 'a' in the column 'type', this row should not get lost (example: row 7 in DF_in)

I hope this explanation is understandable.

Thank you in advance

Answer 1

Let us work with dataframes, so load the data first

df = pd.DataFrame(dict(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
             date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
             type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
             value=[1, 2, 3, 4, 5, 6, 7, 8]))

Then in the below we do the following steps

get rid of second b s
assign the group number in column 'g'
pivot the table via set_index + unstack
rename the columns to the desired format
drop unneeded columns

df1 = df[~((df['type'] == 'b') & (df['type'].shift() == 'b'))].copy()
df1['g'] = np.arange(len(df1))//2
df2 = df1.set_index(['g','type']).unstack(level=1)
df2.columns = ['_'.join(tup).rstrip('_') for tup in df2.columns.values]
df2.drop(columns = 'name_b').rename(columns = {'name_a':'name'})

output

    name    date_a      date_b      value_a value_b
g                   
0   name1   2021-01-01  2021-01-02  1.0     2.0
1   name1   2021-01-03  2021-01-04  3.0     4.0
2   name2   2021-01-06  2021-01-07  6.0     7.0
3   name2   2021-01-08  NaN         8.0     NaN

Answer 2

datatable does not have reshaping functions that allow flipping between vertical and horizontal positions; as such, pandas is your best bet.

Below is my attempt at your challenge:

    from datatable import dt
    import pandas as pd

    df = DF_in.to_pandas()

    (df
     .assign(temp = df.index, # needed for ranking
             b_first = lambda df: df.groupby('name')['type'].transform('first'))
     .assign(temp = lambda df: df.groupby('name')['temp'].rank())
      # get rid of rows in groups where b is first
     .query('~(temp==1 and b_first=="b")')
      # needed to get unique values in index when pivoting
     .assign(temp = lambda df: df.groupby(['name','type']).cumcount())
     .pivot(['name','temp'], ['type'], ['date','value'])
     .pipe(lambda df: df.set_axis(df.columns.to_flat_index(), axis='columns')
     .rename(columns = lambda df: "_".join(df)))
     .droplevel('temp')
     .reset_index()
      )

    name      date_a      date_b value_a value_b
0  name1  2021-01-01  2021-01-02       1       2
1  name1  2021-01-03  2021-01-04       3       4
2  name2  2021-01-06  2021-01-07       6       7
3  name2  2021-01-08         NaN       8     NaN

Summary:

Filter out the rows where 'b' is the first entry in the group
to avoid error due to duplicate indices when pivoting(reindexing), create a temporary cumcount column

the rest relies on pivot and some name editing (set_axis and rename functions). You can abstract a bit further with the pivot_wider function from pyjanitor :

 # pip install pyjanitor import janitor (df.assign(temp = df.index, b_first = lambda df: df.groupby('name')['type'].transform('first')).assign(temp = lambda df: df.groupby('name')['temp'].rank()).query('~(temp==1 and b_first=="b")').assign(temp = lambda df: df.groupby(['name','type']).cumcount()).pivot_wider(index=['name', 'temp'], names_from=['type'], values_from=['date','value'], names_sep="_", names_from_position='last').drop(columns='temp') )

Answer 3

Thank you all very much for your answers. In the meantime I developed a solution that uses only datatable package a uses some workarounds for the current limitations:

define a function to create id for adjacent rows: 1,1,2,2,...
create column id that contains row index
get id of rows to be deleted as list
subtract row id's to be deleted from all row id's
subset the Frame based on the remaining row id's
get number of rows per group
use the function for each group and use the number of rows as input, create a list with all results (same length as Frame after subset). Bind this to the Frame
create two subset Frames based on column type ('a' or 'b')
join df2 on df1

code:

import math
import datatable as dt
from datatable import dt, f, by, update, join

DF_in = dt.Frame(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
                 date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
                 type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
                 value=[1, 2, 3, 4, 5, 6, 7, 8])



def group_id(n):
    l = [x for x in range(0, math.floor(n / 2))]
    l = sorted(l * 2)
    if n % 2 != 0:
        try:
            l.append(l[-1] + 1)
        except IndexError:
            l.append(0)
    return l


DF_in['id'] = range(DF_in.nrows)
first_row = f.id==dt.min(f.id)
row_eq_b = dt.first(f.type)=="b"
remove_rows = first_row & row_eq_b
DF_in[:, update(remove_rows = ~remove_rows), 'name']
DF_in = DF_in[f[-1]==1, :-1]
group_count = DF_in[:, {"Count": dt.count()}, by('name')][:, 'Count'].to_list()[0]
group_id_column = []

for x in group_count:
    group_id_column = group_id_column + group_id(x)

DF_in['group_id'] = dt.Frame(group_id_column)
df1 = DF_in[f.type == 'a', ['name', 'date', 'value', 'group_id']]
df2 = DF_in[f.type == 'b', ['name', 'date', 'value', 'group_id']]

df2.key = ['name', 'group_id']
DF_out = df1[:, :, join(df2)]
DF_out.names = {'date': 'date_a', 'value': 'value_a', 'date.0': 'date_b', 'value.0': 'value_b'}

DF_out[:, ['name', 'date_a', 'date_b', 'value_a', 'value_b']]

   | name   date_a      date_b      value_a  value_b
-- + -----  ----------  ----------  -------  -------
 0 | name1  2021-01-01  2021-01-02        1        2
 1 | name1  2021-01-03  2021-01-04        3        4
 2 | name2  2021-01-06  2021-01-07        6        7
 3 | name2  2021-01-08  NA                8       NA

Python datatable/pandas reshaping problem

Question

3 answers

solution1
1 2021-04-03 17:41:31

solution2
1 2021-04-03 23:41:05

solution3
1 ACCPTED 2021-04-04 11:49:34

Python datatable/pandas reshaping problem

Question

3 answers

solution1 1 2021-04-03 17:41:31

solution2 1 2021-04-03 23:41:05

solution3 1 ACCPTED 2021-04-04 11:49:34

solution1
1 2021-04-03 17:41:31

solution2
1 2021-04-03 23:41:05

solution3
1 ACCPTED 2021-04-04 11:49:34