简体   繁体   English

重新格式化 pandas dataframe

[英]reformat a pandas dataframe

I've been working on this for a while I have dataframe that looks like this我已经研究了一段时间我有 dataframe 看起来像这样

tables      columns
tab1        col001
tab1        col002    
tab1        col003 
tab2        col01 
tab2        col02  
tab2        col03 

the real one has 1500 total tables, some column names are duplicated and the entire thing is 80,000 rows by 2 columns, I am trying to get it formatted like this真正的有 1500 个表,一些列名是重复的,整个东西是 80,000 行乘 2 列,我试图让它像这样格式化

tables      columns
tab1        col001,col002,col003
tab2        col01,col02,col03 

I tried a crosstab like so我尝试了这样的交叉表

cross_table = pd.crosstab(df['tables'], 
                      df['columns']).fillna('n/a')

but that's not exactly what I am going for it ends up with all columns as 1's and 0's and is a large sparse matrix但这并不完全是我想要的,它最终将所有列都作为 1 和 0,并且是一个大的稀疏矩阵

I also tried this, but the error of allocating 2 GiB makes me think this is incorrect我也试过这个,但是分配 2 GiB 的错误让我觉得这是不正确的

df.pivot(columns=['tables', 'columns'], values=['columns'])

I also tried pandas melt but that doesn't seem right either我也试过 pandas 融化,但这似乎也不对

then I tried to cast the columns to a list like so然后我尝试将列转换为这样的列表

cols = list(df['columns'].unique())

df['cols'] = df['columns'].str.findall(f'({"|".join(cols)})')

I tried that because it worked before for extracting text, but in a different context, as it is written it just splits each column name into individual characters我试过了,因为它以前用于提取文本,但在不同的上下文中,因为它只是将每个列名拆分为单个字符

SETUP:设置:

df = pd.DataFrame({'tables': {0: 'tab1', 1: 'tab1', 2: 'tab1', 3: 'tab2', 4: 'tab2', 5: 'tab2'},
 'columns': {0: 'col001',
  1: 'col002',
  2: 'col003',
  3: 'col01',
  4: 'col02',
  5: 'col03'}})

1. via groupby : 1. 通过groupby

df = df.groupby('tables').agg(', '.join).reset_index() # Almost same as the answer in the post's comment section via @Psidom 

2. via pivot_table : 2. 通过pivot_table

df = df.pivot_table(index = 'tables', values = 'columns', aggfunc = ', '.join).reset_index()

3. via list comprehension : 3.通过list comprehension

df = pd.DataFrame([(i, ', '.join(df[df['tables'] == i]['columns']))
                   for i in df['tables'].unique()], columns=df.columns)

4. Set_index/unstack option: 4. Set_index/unstack选项:

df = df.set_index('tables', append = True).unstack(0).apply(lambda x: ', '.join(x.dropna()), 1).reset_index(name = 'columns')

5. via pd.get_dummies 5. 通过pd.get_dummies

df = pd.get_dummies(df.tables).mul(df['columns'], 0).agg(', '.join).str.strip(
    ', ').reset_index(name='columns').rename({'index': 'tables'})

OUTPUT: OUTPUT:

  tables                 columns
0   tab1  col001, col002, col003
1   tab2     col01, col02, col03

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM