简体   繁体   English

Pandas - 将列值拆分为新列

[英]Pandas - unstack column values into new columns

I have a large dataframe and I am storing a lot of redundant values that are making it hard to handle my data. 我有一个大型数据帧,我存储了很多冗余值,使得我很难处理我的数据。 I have a dataframe of the form: 我有一个表格的数据框:

import pandas as pd

df = pd.DataFrame([["a","g","n1","y1"], ["a","g","n2","y2"], ["b","h","n1","y3"], ["b","h","n2","y4"]], columns=["meta1", "meta2", "name", "data"])

>>> df

  meta1 meta2 name data
    a     g   n1   y1
    a     g   n2   y2
    b     h   n1   y3
    b     h   n2   y4

where I have the names of the new columns I would like in name and the respective data in data . 在哪里我有我想要的新列的name和数据中的相应data

I would like to produce a dataframe of the form: 我想生成一个表格的数据框:

df = pd.DataFrame([["a","g","y1","y2"], ["b","h","y3","y4"]], columns=["meta1", "meta2", "n1", "n2"])

>>> df

meta1 meta2  n1  n2
  a     g  y1  y2
  b     h  y3  y4

The columns called meta are around 15+ other columns that contain most of the data, and I don't think are particularly well suited to for indexing. 名为meta的列大约有15个以上包含大部分数据的列,我认为它不适合索引。 The idea is that I have a lot of repeated/redundant data stored in meta at the moment and I would like to produce the more compact dataframe presented. 我的想法是,我目前在meta中存储了大量重复/冗余数据,我想生成更紧凑的数据帧。

I have found some similar Qs but can't pinpoint what sort of operations I need to do: pivot, re-index, stack or unstack, etc.? 我找到了一些类似的Q但是无法确定我需要做什么样的操作:枢轴,重新索引,堆栈或拆散等等?

PS - the original index values are unimportant for my purposes. PS - 原始索引值对我来说并不重要。

Any help would be much appreciated. 任何帮助将非常感激。

Question I think is related: 我认为的问题是相关的:

I think the following Q is related to what I am trying to do, but I can't see how to apply it, as I don't want to produce more indexes. 我认为以下Q与我正在尝试做的有关,但我看不到如何应用它,因为我不想生成更多的索引。

If you group your meta columns into a list then you can do this: 如果将元列分组到列表中,则可以执行以下操作:

metas = ['meta1', 'meta2']

new_df = df.set_index(['name'] + metas).unstack('name')
print new_df

            data    
name          n1  n2
meta1 meta2         
a     g       y1  y2
b     h       y3  y4

Which gets you most of the way there. 哪个可以帮到你。 Additional tailoring can get you the rest of the way. 额外的剪裁可以让你完成其余的工作。

print new_df.data.rename_axis([None], axis=1).reset_index()

  meta1 meta2  n1  n2
0     a     g  y1  y2
1     b     h  y3  y4

You can use pivot_table with reset_index and rename_axis (new in pandas 0.18.0 ): 您可以将pivot_tablereset_indexrename_axispandas 0.18.0新内容):

print (df.pivot_table(index=['meta1','meta2'], 
                      columns='name', 
                      values='data', 
                      aggfunc='first')
         .reset_index()
         .rename_axis(None, axis=1))

  meta1 meta2  n1  n2
0     a     g  y1  y2
1     b     h  y3  y4

But better is use aggfunc join : 但更好的是使用aggfunc join

print (df.pivot_table(index=['meta1','meta2'], 
                      columns='name', 
                      values='data', 
                      aggfunc=', '.join)
         .reset_index()
         .rename_axis(None, axis=1))

  meta1 meta2  n1  n2
0     a     g  y1  y2
1     b     h  y3  y4

Explanation, why join is generally better as first : 解释,为什么join通常比first更好:

If use first , you can lost all data which are not first in each group by index , but join concanecate them: 如果first使用,您可以丢失所有不是每个组中的第一个index ,但是join并使它们合并:

import pandas as pd

df = pd.DataFrame([["a","g","n1","y1"], 
                   ["a","g","n2","y2"], 
                   ["a","g","n1","y3"], 
                   ["b","h","n2","y4"]], columns=["meta1", "meta2", "name", "data"])

print (df)
  meta1 meta2 name data
0     a     g   n1   y1
1     a     g   n2   y2
2     a     g   n1   y3
3     b     h   n2   y4

print (df.pivot_table(index=['meta1','meta2'], 
                      columns='name', 
                      values='data', 
                      aggfunc='first')
         .reset_index()
         .rename_axis(None, axis=1))
  meta1 meta2    n1  n2
0     a     g    y1  y2
1     b     h  None  y4

print (df.pivot_table(index=['meta1','meta2'], 
                      columns='name', 
                      values='data', 
                      aggfunc=', '.join)
         .reset_index()
         .rename_axis(None, axis=1))

  meta1 meta2      n1  n2
0     a     g  y1, y3  y2
1     b     h    None  y4 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM