[英]pandas: concatenate dataframes, forward-fill and multiindex on column data
I have 2 csv files with the same column names, but different values. 我有2个csv文件具有相同的列名,但值不同。
The first column is the index ( time
) and one of the data columns is a unique identifier ( id
) 第一列是索引(
time
),其中一个数据列是唯一标识符( id
)
The index ( time
) is different for each csv file. 每个csv文件的索引(
time
)都不同。
I have read the data into 2 dataframes using read_csv
, giving me the following: 我已经使用
read_csv
将数据读入2个数据帧,给出了以下内容:
+-------+------+-------+
| id | size | price |
+-------+-------+------+-------+
| time | | | |
+-------+-------+------+-------+
| t0 | ID1 | 10 | 110 |
| t2 | ID1 | 12 | 109 |
| t6 | ID1 | 20 | 108 |
+-------+-------+------+-------+
+-------+------+-------+
| id | size | price |
+-------+-------+------+-------+
| time | | | |
+-------+-------+------+-------+
| t1 | ID2 | 9 | 97 |
| t3 | ID2 | 15 | 94 |
| t5 | ID2 | 13 | 100 |
+-------+-------+------+-------+
I would like to create a single large dataframe with entries for both, and use ffill
to forward fill values from the previous time-step. 我想创建一个包含两个条目的单个大型数据
ffill
,并使用ffill
转发前一个时间步的填充值。
I am able to achieve this using a combination of concat
, sort
and ffill
. 我可以使用
concat
, sort
和ffill
的组合来实现这ffill
。
However, it requires renaming the columns of one of the dataframes first, so that there aren't name clashes 但是,它需要首先重命名其中一个数据帧的列,以便不存在名称冲突
df2.columns = [ 'id', 'id2_size', 'id2_price' ]
df = pd.concat([df1, df2]).sort().ffill()
This results in the following dataframe: 这导致以下数据帧:
+------+------+-------+----------+-----------+
| id | size | price | id2_size | id2_price |
+-------+------+------+-------+----------+-----------+
| time | | | | | |
+-------+------+------+-------+----------+-----------+
| t0 | ID1 | 10 | 110 | nan | nan |
| t1 | ID2 | 10 | 110 | 9 | 97 |
| t2 | ID1 | 12 | 109 | 9 | 97 |
| t3 | ID2 | 12 | 109 | 15 | 94 |
| t5 | ID2 | 12 | 109 | 13 | 100 |
| t6 | ID1 | 20 | 108 | 13 | 100 |
+-------+------+------+-------+----------+-----------+
My current method is fairly klunky in that I have to rename the columns of one of the dataframes. 我当前的方法相当笨重,因为我必须重命名其中一个数据帧的列。
I believe a better way to represent the data would be use a multiindex with the 2nd dimension's value coming from the id
column. 我相信一种表示数据的更好方法是使用多索引 ,第二维的值来自
id
列。
The resulting dataframe would look like this: 结果数据框如下所示:
+--------------+--------------+
| ID1 | ID2 |
+------+-------+------+-------+
| size | price | size | price |
+-------+------+-------+------+-------+
| time | | | | |
+-------+------+-------+------+-------+
| t0 | 10 | 110 | nan | nan |
| t1 | 10 | 110 | 9 | 97 |
| t2 | 12 | 109 | 9 | 97 |
| t3 | 12 | 109 | 15 | 94 |
| t5 | 12 | 109 | 13 | 100 |
| t6 | 20 | 108 | 13 | 100 |
+-------+------+-------+------+-------+
Is this possible? 这可能吗?
If so, what steps would be required to go from the 2 dataframes read from csv, to the final merged multiindexed dataframe? 如果是这样,从csv读取的2个数据帧到最终合并的多索引数据帧需要采取哪些步骤?
Here's a one-liner that does what you ask, although it's a bit convoluted in terms of stacking/unstacking: 这是一个单行,可以满足您的要求,尽管在堆叠/卸载方面有点复杂:
df1.append(df2).set_index(['time','id']).sort().stack().unstack(level=[1,2]).ffill()
id ID1 ID2
size price size price
time
t0 10 110 NaN NaN
t1 10 110 9 97
t2 12 109 9 97
t3 12 109 15 94
t5 12 109 13 100
t6 20 108 13 100
FWIW, my default approach would have been something like the following, which is a little more straightforward (less stacking/unstacking) and would give you the same basic results, but with a different column organization: FWIW,我的默认方法将类似于以下内容,这更简单(更少堆叠/卸载)并且会给您相同的基本结果,但具有不同的列组织:
df1.append(df2).set_index(['time','id']).sort().unstack().ffill()
size price
id ID1 ID2 ID1 ID2
time
t0 10 NaN 110 NaN
t1 10 9 110 97
t2 12 9 109 97
t3 12 15 109 94
t5 12 13 109 100
t6 20 13 108 100
And along those lines, you could then add swaplevel
and sort
to get the columns reorganized to be like in the first approach: 沿着这些方向,您可以添加
swaplevel
并进行sort
,以便重新组织列,就像在第一种方法中一样:
df1.append(df2).set_index(['time','id']).sort().unstack().ffill().swaplevel(0,1,axis=1).sort(axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.