简体   繁体   English

来自一系列数据框的Pandas multiindex

[英]Pandas multiindex from series of dataframes

I have a series of dataframes with identical structure that represent results of a simulation for each hour of the year. 我有一系列具有相同结构的数据框,它们代表一年中每个小时的模拟结果。 Each simulation contains results for a series of coordinates (x,y). 每个模拟都包含一系列坐标(x,y)的结果。

Each dataframe is imported from a csv file that has time information only in the file name. 每个数据帧都是从仅在文件名中具有时间信息的csv文件导入的。 Example: 例:

results_YYMMDDHH.csv

contains data such 包含这样的数据

   x   y         a         b
 0.0 0.0  0.318705 -0.871259
 0.1 0.0 -0.937012  0.704270
 0.1 0.1 -0.032225 -1.939544
 0.0 0.1 -1.874781 -0.033073

I would like to create a single MultiIndexed Dataframe (level 0 is time and level 1 is (x,y)) that would allow me to perform various operations like averages, sums, max, etc. between these dataframes using the resampling or groupby methods. 我想创建一个MultiIndexed Dataframe(级别0是时间,级别1是(x,y)),这将允许我使用重采样或groupby方法在这些数据帧之间执行各种操作,例如平均值,总和,最大值等。 。 For each time step 对于每个时间步

The resulting dataframe should look something like this 结果数据框应如下所示

                       x   y         a         b
2010-01-01 10:00     0.0 0.0  0.318705 -0.871259
                     0.1 0.0 -0.934512  0.745270
                     0.1 0.1 -0.0334525 -1.963544
                     0.0 0.1 -1.835781 -0.067573

2010-01-01 11:00     0.0 0.0  0.318705 -0.871259
                     0.1 0.0 -0.923012  0.745670
                     0.1 0.1 -0.035225 -1.963544
                     0.0 0.1 -1.835781 -0.067573
.................
.................
2010-12-01 10:00     0.0 0.0  0.318705 -0.871259
                     0.1 0.0 -0.923012  0.723270
                     0.1 0.1 -0.034225 -1.963234
                     0.0 0.1 -1.835781 -0.067233

You can imagine this for each hour of the year. 您可以想象一年中的每个小时。 I would like now to be able to calculate for example the average for the whole year or the average for June. 我现在希望能够计算出例如全年的平均值或6月的平均值。 Also any other function like the number of hours above a certain threshold or between a min and a max value. 还有任何其他功能,例如超过特定阈值或最小与最大值之间的小时数。 Please bear in mind that the result should be in any of these operations a DataFrame. 请记住,结果在任何这些操作中都应为DataFrame。 For example the monthly averages should look like 例如,每月平均值应该像

              x   y     a     b
2010-01     0.0 0.0  0.45 -0.13
2010-02     0.1 0.0  0.55 -0.87
2010-03     0.1 0.1  0.24 -0.83
2010-04     0.0 0.1  0.11 -0.87

How do I build this MultiIndexed dataframe? 如何建立此MultiIndexed数据框? I picture this like a timeseries of dataframes. 我将其描述为数据帧的时间序列。

I would make a Panel then convert it into a multiindexed DataFrame using to_frame() : 我将制作一个Panel,然后使用to_frame()将其转换为多索引to_frame()

In [29]: df1 = pd.DataFrame(dict(a=[0.318705,-0.937012,-0.032225,-1.874781], b=[-0.871259,0.704270,-1.939544,-0.033073]))

In [30]: df2 = pd.DataFrame(dict(a=[0.318705,-0.937012,-0.032225,-1.874781], b=[-0.871259,0.704270,-1.939544,-0.033073]))

In [31]: df1
Out[31]:
          a         b
0  0.318705 -0.871259
1 -0.937012  0.704270
2 -0.032225 -1.939544
3 -1.874781 -0.033073

In [32]: data = {datetime.datetime(2010,6,21,10,0,0): df1, datetime.datetime(2010,6,22,10,0,0): df2}

In [33]: p = pd.Panel(data)

In [34]: p.to_frame()
Out[34]:
             2010-06-21 10:00:00  2010-06-22 10:00:00
major minor
0     a                 0.318705             0.318705
      b                -0.871259            -0.871259
1     a                -0.937012            -0.937012
      b                 0.704270             0.704270
2     a                -0.032225            -0.032225
      b                -1.939544            -1.939544
3     a                -1.874781            -1.874781
      b                -0.033073            -0.033073

Depending on how you want to look at your data, you can use swapaxes to rearrange it: 根据想要查看数据的方式,可以使用swapaxes重新排列数据:

In [35]: p.swapaxes("major", "items").to_frame()
Out[35]:
                                  0         1         2         3
major               minor
2010-06-21 10:00:00 a      0.318705 -0.937012 -0.032225 -1.874781
                    b     -0.871259  0.704270 -1.939544 -0.033073
2010-06-22 10:00:00 a      0.318705 -0.937012 -0.032225 -1.874781
                    b     -0.871259  0.704270 -1.939544 -0.033073

Here is a different answer from my earlier one, in light of the more fully explained question. 鉴于更充分解释的问题,这是与我之前的答案不同的答案。 Iterate through the files and read them into pandas, parse the date and add it to the dataframe, then use set_index to create your multiindex. 遍历文件并将它们读入pandas,解析日期并将其添加到数据set_index ,然后使用set_index创建多set_index Once you've got all your dataframes, use pd.concat to combine them: 获得所有数据pd.concat ,请使用pd.concat进行组合:

dataframes = []
for filename in filenames:
    df = pd.read_csv(filename)
    df["datetime"] = datetime.datetime.strptime(filename[8:18], "%Y%m%d%H")
    dataframes.append(df.set_index(["datetime","x", "y"]))

combined_df = pd.concat(dataframes)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM