reformat a pandas dataframe

Question

I've been working on this for a while I have dataframe that looks like this

tables      columns
tab1        col001
tab1        col002    
tab1        col003 
tab2        col01 
tab2        col02  
tab2        col03

the real one has 1500 total tables, some column names are duplicated and the entire thing is 80,000 rows by 2 columns, I am trying to get it formatted like this

tables      columns
tab1        col001,col002,col003
tab2        col01,col02,col03

I tried a crosstab like so

cross_table = pd.crosstab(df['tables'], 
                      df['columns']).fillna('n/a')

but that's not exactly what I am going for it ends up with all columns as 1's and 0's and is a large sparse matrix

I also tried this, but the error of allocating 2 GiB makes me think this is incorrect

df.pivot(columns=['tables', 'columns'], values=['columns'])

I also tried pandas melt but that doesn't seem right either

then I tried to cast the columns to a list like so

cols = list(df['columns'].unique())

df['cols'] = df['columns'].str.findall(f'({"|".join(cols)})')

I tried that because it worked before for extracting text, but in a different context, as it is written it just splits each column name into individual characters

Answer 1

SETUP:

df = pd.DataFrame({'tables': {0: 'tab1', 1: 'tab1', 2: 'tab1', 3: 'tab2', 4: 'tab2', 5: 'tab2'},
 'columns': {0: 'col001',
  1: 'col002',
  2: 'col003',
  3: 'col01',
  4: 'col02',
  5: 'col03'}})

1. via `groupby` :

df = df.groupby('tables').agg(', '.join).reset_index() # Almost same as the answer in the post's comment section via @Psidom

2. via `pivot_table` :

df = df.pivot_table(index = 'tables', values = 'columns', aggfunc = ', '.join).reset_index()

3. via `list comprehension` :

df = pd.DataFrame([(i, ', '.join(df[df['tables'] == i]['columns']))
                   for i in df['tables'].unique()], columns=df.columns)

4. `Set_index/unstack` option:

df = df.set_index('tables', append = True).unstack(0).apply(lambda x: ', '.join(x.dropna()), 1).reset_index(name = 'columns')

5. via `pd.get_dummies`

df = pd.get_dummies(df.tables).mul(df['columns'], 0).agg(', '.join).str.strip(
    ', ').reset_index(name='columns').rename({'index': 'tables'})

OUTPUT:

  tables                 columns
0   tab1  col001, col002, col003
1   tab2     col01, col02, col03

reformat a pandas dataframe

Question

1 answers

solution1
3 ACCPTED 2021-06-03 20:17:11

SETUP:

1. via `groupby` :

2. via `pivot_table` :

3. via `list comprehension` :

4. `Set_index/unstack` option:

5. via `pd.get_dummies`

OUTPUT:

reformat a pandas dataframe

Question

1 answers

solution1 3 ACCPTED 2021-06-03 20:17:11

SETUP:

1. via groupby :

2. via pivot_table :

3. via list comprehension :

4. Set_index/unstack option:

5. via pd.get_dummies

OUTPUT:

solution1
3 ACCPTED 2021-06-03 20:17:11

1. via `groupby` :

2. via `pivot_table` :

3. via `list comprehension` :

4. `Set_index/unstack` option:

5. via `pd.get_dummies`