简体   繁体   English

pandas:在列数据上连接数据帧,前向填充和多索引

[英]pandas: concatenate dataframes, forward-fill and multiindex on column data

I have 2 csv files with the same column names, but different values. 我有2个csv文件具有相同的列名,但值不同。

The first column is the index ( time ) and one of the data columns is a unique identifier ( id ) 第一列是索引( time ),其中一个数据列是唯一标识符( id

The index ( time ) is different for each csv file. 每个csv文件的索引( time )都不同。

I have read the data into 2 dataframes using read_csv , giving me the following: 我已经使用read_csv将数据读入2个数据帧,给出了以下内容:

        +-------+------+-------+
        | id    | size | price |
+-------+-------+------+-------+
| time  |       |      |       |
+-------+-------+------+-------+
| t0    | ID1   | 10   | 110   |
| t2    | ID1   | 12   | 109   |
| t6    | ID1   | 20   | 108   |
+-------+-------+------+-------+

        +-------+------+-------+
        | id    | size | price |
+-------+-------+------+-------+
| time  |       |      |       |
+-------+-------+------+-------+
| t1    | ID2   |  9   |  97   |
| t3    | ID2   | 15   |  94   |
| t5    | ID2   | 13   | 100   |
+-------+-------+------+-------+

I would like to create a single large dataframe with entries for both, and use ffill to forward fill values from the previous time-step. 我想创建一个包含两个条目的单个大型数据ffill ,并使用ffill转发前一个时间步的填充值。

I am able to achieve this using a combination of concat , sort and ffill . 我可以使用concatsortffill的组合来实现这ffill

However, it requires renaming the columns of one of the dataframes first, so that there aren't name clashes 但是,它需要首先重命名其中一个数据帧的列,以便不存在名称冲突

df2.columns = [ 'id', 'id2_size', 'id2_price' ]
df = pd.concat([df1, df2]).sort().ffill()

This results in the following dataframe: 这导致以下数据帧:

        +------+------+-------+----------+-----------+
        | id   | size | price | id2_size | id2_price |
+-------+------+------+-------+----------+-----------+
| time  |      |      |       |          |           |
+-------+------+------+-------+----------+-----------+
| t0    | ID1  | 10   | 110   |     nan  |     nan   |
| t1    | ID2  | 10   | 110   |      9   |      97   |
| t2    | ID1  | 12   | 109   |      9   |      97   |
| t3    | ID2  | 12   | 109   |     15   |      94   |
| t5    | ID2  | 12   | 109   |     13   |     100   |
| t6    | ID1  | 20   | 108   |     13   |     100   |
+-------+------+------+-------+----------+-----------+

My current method is fairly klunky in that I have to rename the columns of one of the dataframes. 我当前的方法相当笨重,因为我必须重命名其中一个数据帧的列。

I believe a better way to represent the data would be use a multiindex with the 2nd dimension's value coming from the id column. 我相信一种表示数据的更好方法是使用多索引 ,第二维的值来自id列。

The resulting dataframe would look like this: 结果数据框如下所示:

        +--------------+--------------+
        | ID1          | ID2          |
        +------+-------+------+-------+
        | size | price | size | price |
+-------+------+-------+------+-------+
| time  |      |       |      |       |
+-------+------+-------+------+-------+
| t0    | 10   | 110   | nan  | nan   |
| t1    | 10   | 110   |  9   |  97   |
| t2    | 12   | 109   |  9   |  97   |
| t3    | 12   | 109   | 15   |  94   |
| t5    | 12   | 109   | 13   | 100   |
| t6    | 20   | 108   | 13   | 100   |
+-------+------+-------+------+-------+

Is this possible? 这可能吗?
If so, what steps would be required to go from the 2 dataframes read from csv, to the final merged multiindexed dataframe? 如果是这样,从csv读取的2个数据帧到最终合并的多索引数据帧需要采取哪些步骤?

Here's a one-liner that does what you ask, although it's a bit convoluted in terms of stacking/unstacking: 这是一个单行,可以满足您的要求,尽管在堆叠/卸载方面有点复杂:

df1.append(df2).set_index(['time','id']).sort().stack().unstack(level=[1,2]).ffill()

id    ID1        ID2      
     size price size price
time                      
t0     10   110  NaN   NaN
t1     10   110    9    97
t2     12   109    9    97
t3     12   109   15    94
t5     12   109   13   100
t6     20   108   13   100

FWIW, my default approach would have been something like the following, which is a little more straightforward (less stacking/unstacking) and would give you the same basic results, but with a different column organization: FWIW,我的默认方法将类似于以下内容,这更简单(更少堆叠/卸载)并且会给您相同的基本结果,但具有不同的列组织:

df1.append(df2).set_index(['time','id']).sort().unstack().ffill()

     size     price     
id    ID1 ID2   ID1  ID2
time                    
t0     10 NaN   110  NaN
t1     10   9   110   97
t2     12   9   109   97
t3     12  15   109   94
t5     12  13   109  100
t6     20  13   108  100

And along those lines, you could then add swaplevel and sort to get the columns reorganized to be like in the first approach: 沿着这些方向,您可以添加swaplevel并进行sort ,以便重新组织列,就像在第一种方法中一样:

df1.append(df2).set_index(['time','id']).sort().unstack().ffill().swaplevel(0,1,axis=1).sort(axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM