简体   繁体   中英

“Parallel” indexing in pandas (not hierarchical)

Short version: I have two TimeSeries (recording start and recording end) I would like to use as indices for data in a Panel (or DataFrame). Not hierarchical, but parallel. I am uncertain how to do this.

Long version:

I am constructing a pandas Panel with some data akin to temperature and density at certain distances from an antenna. As I see it, the most natural structure is having eg temp and dens as items (ie sub-DataFrames of the Panel), recording time as major axis (index), and thus distance from the antenna as minor axis (colums).

My problem is this: For each recording, the instrument averages/integrates over some amount of time. Thus, for each data dump, two timestamps are saved: start recording and end recording. I need both of those. Thus, I would need something which might be called "parallel indexing", where two different TimeSeries ( startRec and endRec ) work as indices, and I can get whichever I prefer for a certain data point. Of course, I don't really need to index by both, but both need to be naturally available in the data structure. For example, for any given temperature or density recording, I need to be able to get both the start and end time of the recording.

I could of course keep the two TimeSeries in a separate DataFrame, but with the main point of pandas being automatic data alignment, this is not really ideal.

How can I best achieve this?

Example data

Sample Panel with three recordings at two distances from the antenna:

import pandas as pd
import numpy as np

data = pd.Panel(data={'temp': np.array([[21, 20],
                                        [19, 17],
                                        [15, 14]]),
                      'dens': np.array([[1001, 1002],
                                        [1000, 998],
                                        [997, 995]])},
                minor_axis=['1m', '3m'])

Output of data :

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: dens to temp
Major_axis axis: 0 to 2
Minor_axis axis: 1m to 3m

Here, the major axis is currently only an integer-based index (0 to 2). The minor axis is the two measurement distances from the antenna.

I have two TimeSeries I'd like to use as indices:

from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 00),
                          datetime(2013, 11, 12, 15, 00, 00),
                          datetime(2013, 11, 13, 15, 00, 00)])

endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 10),
                        datetime(2013, 11, 12, 15, 00, 10),
                        datetime(2013, 11, 13, 15, 00, 10)])

Output of startRec :

0   2013-11-11 15:00:00
1   2013-11-12 15:00:00
2   2013-11-13 15:00:00
dtype: datetime64[ns]

Being in a Panel makes this a little trickier. I typically stick with DataFrames .

But how does this look:

import pandas as pd
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 0),
                          datetime(2013, 11, 12, 15, 0, 0),
                          datetime(2013, 11, 13, 15, 0, 0)])

endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 10),
                        datetime(2013, 11, 12, 15, 0, 10),
                        datetime(2013, 11, 13, 15, 0, 10)])
_data1m = pd.DataFrame(data={
                          'temp': np.array([21, 19, 15]),
                          'dens': np.array([1001, 1000, 997]),
                          'start': startRec,
                          'end': endRec
                          }
                    )

_data3m = pd.DataFrame(data={
                          'temp': np.array([20, 17, 14]),
                          'dens': np.array([1002, 998, 995]),
                          'start': startRec,
                          'end': endRec
                          }
                    )


_data1m.set_index(['start', 'end'], inplace=True)
_data3m.set_index(['start', 'end'], inplace=True)

data = pd.Panel(data={'1m': _data1m, '3m': _data3m}) 
data.loc['3m'].select(lambda row: row[0] < pd.Timestamp('2013-11-12') or 
                                  row[1] < pd.Timestamp('2013-11-13'))

and that outputs:

                                         dens  temp
start               end                            
2013-11-11 15:00:00 2013-11-11 15:00:10  1002    20
2013-11-12 15:00:00 2013-11-12 15:00:10   998    17

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM