简体   繁体   中英

reformat a pandas dataframe

I've been working on this for a while I have dataframe that looks like this

tables      columns
tab1        col001
tab1        col002    
tab1        col003 
tab2        col01 
tab2        col02  
tab2        col03 

the real one has 1500 total tables, some column names are duplicated and the entire thing is 80,000 rows by 2 columns, I am trying to get it formatted like this

tables      columns
tab1        col001,col002,col003
tab2        col01,col02,col03 

I tried a crosstab like so

cross_table = pd.crosstab(df['tables'], 
                      df['columns']).fillna('n/a')

but that's not exactly what I am going for it ends up with all columns as 1's and 0's and is a large sparse matrix

I also tried this, but the error of allocating 2 GiB makes me think this is incorrect

df.pivot(columns=['tables', 'columns'], values=['columns'])

I also tried pandas melt but that doesn't seem right either

then I tried to cast the columns to a list like so

cols = list(df['columns'].unique())

df['cols'] = df['columns'].str.findall(f'({"|".join(cols)})')

I tried that because it worked before for extracting text, but in a different context, as it is written it just splits each column name into individual characters

SETUP:

df = pd.DataFrame({'tables': {0: 'tab1', 1: 'tab1', 2: 'tab1', 3: 'tab2', 4: 'tab2', 5: 'tab2'},
 'columns': {0: 'col001',
  1: 'col002',
  2: 'col003',
  3: 'col01',
  4: 'col02',
  5: 'col03'}})

1. via groupby :

df = df.groupby('tables').agg(', '.join).reset_index() # Almost same as the answer in the post's comment section via @Psidom 

2. via pivot_table :

df = df.pivot_table(index = 'tables', values = 'columns', aggfunc = ', '.join).reset_index()

3. via list comprehension :

df = pd.DataFrame([(i, ', '.join(df[df['tables'] == i]['columns']))
                   for i in df['tables'].unique()], columns=df.columns)

4. Set_index/unstack option:

df = df.set_index('tables', append = True).unstack(0).apply(lambda x: ', '.join(x.dropna()), 1).reset_index(name = 'columns')

5. via pd.get_dummies

df = pd.get_dummies(df.tables).mul(df['columns'], 0).agg(', '.join).str.strip(
    ', ').reset_index(name='columns').rename({'index': 'tables'})

OUTPUT:

  tables                 columns
0   tab1  col001, col002, col003
1   tab2     col01, col02, col03

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM