简体   繁体   English

如何串联pandas.DataFrames列

[英]How to concatenate pandas.DataFrames columns

I have a DataFrame called raw_df : 我有一个称为raw_df

columns = ['force0', 'distance0', 'force1', 'distance1']

raw_data = [{'force0': 1.2, 'distance0': 0.0, 'force1': 0.5, 'distance1': 0.0},
            {'force0': 1.3, 'distance0': 0.1, 'force1': 0.6, 'distance1': 0.0},
            {'force0': 1.4, 'distance0': 0.2, 'force1': 0.7, 'distance1': 0.3},
            {'force0': 1.5, 'distance0': 0.5, 'force1': 0.8, 'distance1': 0.6}]

raw_df = pd.DataFrame(raw_data, columns=columns)

raw_df looks like this: raw_df看起来像这样:

   force0  distance0  force1  distance1
0     1.2        0.0     0.5        0.0
1     1.3        0.1     0.6        0.0
2     1.4        0.2     0.7        0.3
3     1.5        0.5     0.8        0.6

At the moment there is no index but I would like the distance columns to be combined into one index so the columns are then: 目前没有索引,但我希望将distance列合并为一个索引,因此这些列为:

          force0  force1
distance                
0.0          1.2     0.5
0.0          NaN.    0.6
0.1          1.3     NaN
0.2          1.4     NaN
0.3          NaN     0.7
0.5          1.5     NaN
0.6          NaN     0.8

Note that there were 2 entries in force1 for distance1 = 0.0. 请注意,在force1中,距离1 = 0.0有2个条目。

The index (distances) should NOT be sorted: they increase then decrease variably and the original order for each test is important. 索引(距离)不应排序:它们先升后降,而每个测试的原始顺序很重要。

Stefan posted an amazing answer to my poorly-described question but it seemed to fill in any missing forces with other numbers (which would be misleading because there were no force measurements for those distances in those tests). 斯特凡(Stefan)对我的问题描述得不好的问题发表了一个惊人的答案,但似乎用其他数字填补了所有缺失的力(这会产生误导,因为在那些测试中没有针对这些距离的力测量值)。 I have used np.nan for missing values as I think this is what pandas does. 我使用np.nan来缺少值,因为我认为这是pandas所做的。

I think that merge or join might do what I need but couldn't understand the docs . 我认为mergejoin可能会满足我的需要,但无法理解文档

Perhaps pandas.DataFrame was not designed for such data, and I should use numpy.genfromtxt instead and just select the columns I need on the fly: I don't see any advantage to using a pandas.DataFrame if I'm selecting columns on the fly (because I'm not using an index in that case). 也许pandas.DataFrame不是为此类数据而设计的,我应该改用numpy.genfromtxt并随便选择我需要的列:如果我要选择pandas.DataFrame列,我看不出任何好处飞(因为在这种情况下我不使用索引)。

Thanks for any help. 谢谢你的帮助。

If I'm understanding correctly, you are starting from a situation similar to this: 如果我理解正确,那么您是从类似于以下情况开始的:

columns = list(sum(list(zip(['Forces{}'.format(i) for i in range(4)], ['Distances{}'.format(i) for i in range(4)])), ()))
df = pd.DataFrame(np.random.randint(1, 11, size=(100, 8)), columns=columns)

   Forces0  Distances0  Forces1  Distances1  Forces2  Distances2  Forces3  \
0        3           5        8           3        7           4        2   
1        1           4       10           9        9           3        6   
2       10           3        1           3        3           7        8   
3        2           1        3           6       10          10       10   
4        4           2        9           1        3          10        8   

   Distances3  
0           8  
1           5  
2           3  
3           8  
4           8  

and you are aiming to have the various Distance columns form a single index while the respective Force columns remain in place. You could 并且您的目标是让各种Distance列构成一个index而相应的Force columns remain in place. You could columns remain in place. You could stack` the frame like so: columns remain in place. You could像这样堆叠框架:

df.set_index([c for c in df.columns if c.startswith('Force')], inplace=True)
df = df.stack().reset_index(level=-1, drop=True).reset_index().rename(columns={0: 'Distance'})
df.set_index(['Distance'], inplace=True)

to get: 要得到:

          Forces0  Forces1  Forces2  Forces3
Distance                                    
9               7        4        6        7
9               7        4        6        7
1               7        4        6        7
6               7        4        6        7
5               1        2        3        1

I solved the problem using a MultiIndex DataFrame : 我使用MultiIndex DataFrame解决了问题:

  1. Read each test into a separate DataFrame using pd.read_csv() 使用pd.read_csv()将每个测试读入单独的DataFrame中
  2. Combined the DataFrames into one using df = pd.concat(frame_list, keys=test_names) 使用df = pd.concat(frame_list, keys=test_names)将DataFrames合并为一个

Rather than write a long description here, I wrote a Jupyter notebook on the subject comparing the MultiIndex method against just keeping a standard Python list of DataFrames. 我没有在这里写详细说明,而是在主题上写了一个Jupyter笔记本 ,将MultiIndex方法与仅保留标准Python DataFrames列表进行了比较。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM