[英]reformat a pandas dataframe
I've been working on this for a while I have dataframe that looks like this我已经研究了一段时间我有 dataframe 看起来像这样
tables columns
tab1 col001
tab1 col002
tab1 col003
tab2 col01
tab2 col02
tab2 col03
the real one has 1500 total tables, some column names are duplicated and the entire thing is 80,000 rows by 2 columns, I am trying to get it formatted like this真正的有 1500 个表,一些列名是重复的,整个东西是 80,000 行乘 2 列,我试图让它像这样格式化
tables columns
tab1 col001,col002,col003
tab2 col01,col02,col03
I tried a crosstab like so我尝试了这样的交叉表
cross_table = pd.crosstab(df['tables'],
df['columns']).fillna('n/a')
but that's not exactly what I am going for it ends up with all columns as 1's and 0's and is a large sparse matrix但这并不完全是我想要的,它最终将所有列都作为 1 和 0,并且是一个大的稀疏矩阵
I also tried this, but the error of allocating 2 GiB makes me think this is incorrect我也试过这个,但是分配 2 GiB 的错误让我觉得这是不正确的
df.pivot(columns=['tables', 'columns'], values=['columns'])
I also tried pandas melt but that doesn't seem right either我也试过 pandas 融化,但这似乎也不对
then I tried to cast the columns to a list like so然后我尝试将列转换为这样的列表
cols = list(df['columns'].unique())
df['cols'] = df['columns'].str.findall(f'({"|".join(cols)})')
I tried that because it worked before for extracting text, but in a different context, as it is written it just splits each column name into individual characters我试过了,因为它以前用于提取文本,但在不同的上下文中,因为它只是将每个列名拆分为单个字符
df = pd.DataFrame({'tables': {0: 'tab1', 1: 'tab1', 2: 'tab1', 3: 'tab2', 4: 'tab2', 5: 'tab2'},
'columns': {0: 'col001',
1: 'col002',
2: 'col003',
3: 'col01',
4: 'col02',
5: 'col03'}})
groupby
: groupby
:df = df.groupby('tables').agg(', '.join).reset_index() # Almost same as the answer in the post's comment section via @Psidom
pivot_table
: pivot_table
:df = df.pivot_table(index = 'tables', values = 'columns', aggfunc = ', '.join).reset_index()
list comprehension
: list comprehension
:df = pd.DataFrame([(i, ', '.join(df[df['tables'] == i]['columns']))
for i in df['tables'].unique()], columns=df.columns)
Set_index/unstack
option: Set_index/unstack
选项:df = df.set_index('tables', append = True).unstack(0).apply(lambda x: ', '.join(x.dropna()), 1).reset_index(name = 'columns')
pd.get_dummies
pd.get_dummies
df = pd.get_dummies(df.tables).mul(df['columns'], 0).agg(', '.join).str.strip(
', ').reset_index(name='columns').rename({'index': 'tables'})
tables columns
0 tab1 col001, col002, col003
1 tab2 col01, col02, col03
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.