简体   繁体   中英

Python datatable/pandas reshaping problem

I need to reshape my df.

This is my input df:

import pandas as pd
import datatable as dt

DF_in = dt.Frame(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
             date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
             type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
             value=[1, 2, 3, 4, 5, 6, 7, 8])

   | name   date        type  value
-- + -----  ----------  ----  -----
 0 | name1  2021-01-01  a         1
 1 | name1  2021-01-02  b         2
 2 | name1  2021-01-03  a         3
 3 | name1  2021-01-04  b         4
 4 | name2  2021-01-05  b         5
 5 | name2  2021-01-06  a         6
 6 | name2  2021-01-07  b         7
 7 | name2  2021-01-08  a         8

This is the desired output df:

DF_out = dt.Frame(name=['name1', 'name1', 'name2', 'name2'],
              date_a=['2021-01-01', '2021-01-03', '2021-01-06', '2021-01-08'],
              date_b=['2021-01-02', '2021-01-04', '2021-01-07', None],
              value_a=[1, 3, 6, 8],
              value_b=[2, 4, 7, None])

   | name   date_a      date_b      value_a  value_b
-- + -----  ----------  ----------  -------  -------
 0 | name1  2021-01-01  2021-01-02        1        2
 1 | name1  2021-01-03  2021-01-04        3        4
 2 | name2  2021-01-06  2021-01-07        6        7
 3 | name2  2021-01-08  NA                8       NA

If necessary the datatable Frames can be converted into a pandas DataFrame:

DF_in = DF_in.to_pandas()

Transformation:

  • This is a grouped transformation. The grouping column is 'name'.
  • The df is already sorted
  • The number of rows in each group is different and can be even or uneven
  • If the first row in a group has a 'b' in the column 'type' it has to be removed (example: row 4 in DF_in)
  • It is also possible that the last row in a group has an 'a' in the column 'type', this row should not get lost (example: row 7 in DF_in)

I hope this explanation is understandable.

Thank you in advance

Let us work with dataframes, so load the data first

df = pd.DataFrame(dict(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
             date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
             type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
             value=[1, 2, 3, 4, 5, 6, 7, 8]))

Then in the below we do the following steps

  • get rid of second b s
  • assign the group number in column 'g'
  • pivot the table via set_index + unstack
  • rename the columns to the desired format
  • drop unneeded columns
df1 = df[~((df['type'] == 'b') & (df['type'].shift() == 'b'))].copy()
df1['g'] = np.arange(len(df1))//2
df2 = df1.set_index(['g','type']).unstack(level=1)
df2.columns = ['_'.join(tup).rstrip('_') for tup in df2.columns.values]
df2.drop(columns = 'name_b').rename(columns = {'name_a':'name'})

output

    name    date_a      date_b      value_a value_b
g                   
0   name1   2021-01-01  2021-01-02  1.0     2.0
1   name1   2021-01-03  2021-01-04  3.0     4.0
2   name2   2021-01-06  2021-01-07  6.0     7.0
3   name2   2021-01-08  NaN         8.0     NaN

datatable does not have reshaping functions that allow flipping between vertical and horizontal positions; as such, pandas is your best bet.

Below is my attempt at your challenge:

    from datatable import dt
    import pandas as pd

    df = DF_in.to_pandas()

    (df
     .assign(temp = df.index, # needed for ranking
             b_first = lambda df: df.groupby('name')['type'].transform('first'))
     .assign(temp = lambda df: df.groupby('name')['temp'].rank())
      # get rid of rows in groups where b is first
     .query('~(temp==1 and b_first=="b")')
      # needed to get unique values in index when pivoting
     .assign(temp = lambda df: df.groupby(['name','type']).cumcount())
     .pivot(['name','temp'], ['type'], ['date','value'])
     .pipe(lambda df: df.set_axis(df.columns.to_flat_index(), axis='columns')
     .rename(columns = lambda df: "_".join(df)))
     .droplevel('temp')
     .reset_index()
      )

    name      date_a      date_b value_a value_b
0  name1  2021-01-01  2021-01-02       1       2
1  name1  2021-01-03  2021-01-04       3       4
2  name2  2021-01-06  2021-01-07       6       7
3  name2  2021-01-08         NaN       8     NaN

Summary:

  • Filter out the rows where 'b' is the first entry in the group

  • to avoid error due to duplicate indices when pivoting(reindexing), create a temporary cumcount column

  • the rest relies on pivot and some name editing (set_axis and rename functions). You can abstract a bit further with the pivot_wider function from pyjanitor :

     # pip install pyjanitor import janitor (df.assign(temp = df.index, b_first = lambda df: df.groupby('name')['type'].transform('first')).assign(temp = lambda df: df.groupby('name')['temp'].rank()).query('~(temp==1 and b_first=="b")').assign(temp = lambda df: df.groupby(['name','type']).cumcount()).pivot_wider(index=['name', 'temp'], names_from=['type'], values_from=['date','value'], names_sep="_", names_from_position='last').drop(columns='temp') )

Thank you all very much for your answers. In the meantime I developed a solution that uses only datatable package a uses some workarounds for the current limitations:

  1. define a function to create id for adjacent rows: 1,1,2,2,...
  2. create column id that contains row index
  3. get id of rows to be deleted as list
  4. subtract row id's to be deleted from all row id's
  5. subset the Frame based on the remaining row id's
  6. get number of rows per group
  7. use the function for each group and use the number of rows as input, create a list with all results (same length as Frame after subset). Bind this to the Frame
  8. create two subset Frames based on column type ('a' or 'b')
  9. join df2 on df1

code:

import math
import datatable as dt
from datatable import dt, f, by, update, join

DF_in = dt.Frame(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
                 date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
                 type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
                 value=[1, 2, 3, 4, 5, 6, 7, 8])



def group_id(n):
    l = [x for x in range(0, math.floor(n / 2))]
    l = sorted(l * 2)
    if n % 2 != 0:
        try:
            l.append(l[-1] + 1)
        except IndexError:
            l.append(0)
    return l


DF_in['id'] = range(DF_in.nrows)
first_row = f.id==dt.min(f.id)
row_eq_b = dt.first(f.type)=="b"
remove_rows = first_row & row_eq_b
DF_in[:, update(remove_rows = ~remove_rows), 'name']
DF_in = DF_in[f[-1]==1, :-1]
group_count = DF_in[:, {"Count": dt.count()}, by('name')][:, 'Count'].to_list()[0]
group_id_column = []

for x in group_count:
    group_id_column = group_id_column + group_id(x)

DF_in['group_id'] = dt.Frame(group_id_column)
df1 = DF_in[f.type == 'a', ['name', 'date', 'value', 'group_id']]
df2 = DF_in[f.type == 'b', ['name', 'date', 'value', 'group_id']]

df2.key = ['name', 'group_id']
DF_out = df1[:, :, join(df2)]
DF_out.names = {'date': 'date_a', 'value': 'value_a', 'date.0': 'date_b', 'value.0': 'value_b'}

DF_out[:, ['name', 'date_a', 'date_b', 'value_a', 'value_b']]

   | name   date_a      date_b      value_a  value_b
-- + -----  ----------  ----------  -------  -------
 0 | name1  2021-01-01  2021-01-02        1        2
 1 | name1  2021-01-03  2021-01-04        3        4
 2 | name2  2021-01-06  2021-01-07        6        7
 3 | name2  2021-01-08  NA                8       NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM