简体   繁体   English

从熊猫数据框中提取数据作为数据框

[英]Extracting data from Pandas dataframe as dataframe

One of the biggest problems I have experienced in Python's Pandas is the continual defaulting to pandas.core.series.Series type. 我在Python的Pandas中遇到的最大问题之一是持续默认使用pandas.core.series.Series类型。 eg 例如

import numpy as np
import pandas as pd

a = pd.DataFrame( np.random.randn(5,5),columns=list('ABCDE') )
b = a.mean(axis=0)

>>> b
    A    0.399677
    B    0.080594
    C    0.060423
    D   -1.206630
    E    0.153359
    dtype: float64

>>> type(b)
<class 'pandas.core.series.Series'>

So, if I try to insert into a new data frame I get all sorts of errors (ie dimension mismatch, etc). 因此,如果我尝试插入一个新的数据框中,则会遇到各种各样的错误(即尺寸不匹配等)。 It seems to me that when I perform an operation on a data frame the output should be a data frame, not a Series. 在我看来,当我对数据帧执行操作时,输出应该是数据帧,而不是Series。 Does anyone have a recommendation on how to use, eg df.mean(), and have a data frame returned? 是否有人对如何使用提出建议,例如df.mean(),并返回了数据帧?

BEGIN EDIT Sorry, I should have given more detail. 开始编辑抱歉,我应该提供更多详细信息。
I want to selectively average slices of my original data frame, and insert these averaged values into a separate data frame. 我想选择性地平均原始数据帧的片段,然后将这些平均值插入单独的数据帧中。

# This is how I've been trying to do it
# Using <a> from above
b = pd.DataFrame()

# Select out data from original data frame
tmp = a(a.A>5).mean() # Just an example, this is not really my selection criteria

# Now I want to store these averaged values in my aggregated data frame.  
b = pd.concat( [b,tmp] )

I guess my real question is: How can I average data in one data frame and pass it into another for storage? 我想我的真正问题是:如何对一个数据帧中的数据求平均并将其传递给另一数据帧进行存储? END EDIT 结束编辑

EDIT Take 2 I have two data sets (both stored as data frames), both of which are time series. 编辑Take 2我有两个数据集(都存储为数据帧),两个都是时间序列。 Both time series have irregular time stamps: one has a time stamp every ~90s (between hours of 0700 - 2000), the other has one or two time stamps per day (satellite overpass data)). 这两个时间序列都有不规则的时间戳:一个约每90s(0700-2000小时之间)就有一个时间戳,另一个每天有一个或两个时间戳(卫星立交桥数据)。 None of the time stamps are regular (ie they rarely occur at the same time, and they are very rarely centered on the hour, or half hour, etc.). 没有一个时间戳是规则的(即它们很少同时出现,并且很少以小时或半小时等为中心)。 My goal is to take my high-frequency data and average it centered on the satellite's time stamp (+/- 30min) then store the averaged data in a new data frame. 我的目标是获取高频数据,并以卫星时间戳(+/- 30分钟)为中心进行平均,然后将平均数据存储在新的数据帧中。 Here is the actual code I have written so far: 这是我到目前为止编写的实际代码:

# OMI is the satellite data, ~daily resolution
# Pan is surface data, with 90s resolution

# Example data: 
>>> pan.head()
                        hcho     h2o      so2      o3       no2
2010-06-24 14:01:20  0.87784  2.9947      NaN     NaN  0.671104
2010-06-24 14:03:52  0.68877  3.0102      NaN     NaN  0.684615
2010-06-24 14:04:35      NaN     NaN  0.58119  285.76       NaN
2010-06-24 14:05:19  0.75813  3.0218      NaN     NaN  0.693880
2010-06-24 14:06:02      NaN     NaN  0.40973  286.00       NaN

>>> omi.head()
                    ctp  dist           no2        no2std     cf  
2010-06-24 17:51:43    7  23.8  5.179200e+15  1.034600e+15  0.001   
2010-06-26 17:39:34    3   7.0  7.355800e+15  1.158100e+15  0.113   
2010-07-01 17:57:40    9   8.4  5.348300e+15  9.286100e+14  0.040   
2010-07-03 17:45:30    5  32.2  5.285300e+15  8.877800e+14  0.000   

# Code
out = pd.DataFrame()

width = 30 # Defined earlier, input of function
for r in omi.index:
    # Define datetime limits
    d1 = r - dt.timedelta(minutes=width)
    d2 = r + dt.timedelta(minutes=width)
    tmp = pan.truncate(d1,d2).mean(axis=0,skipna=True)

    if tmp.nunique()<>0: # Ensuring there is something in <tmp>
        tmp = pd.DataFrame(tmp,index=[r],columns=pan.columns)
        out = pd.concat([out,tmp],axis=0,ignore_index=False)

You can just construct a DataFrame from the series easily like so: 您可以轻松地从系列中构造一个DataFrame,如下所示:

c = DataFrame(a.mean(axis=0), columns=['mean'])
c

Out[91]:
       mean
A -0.210582
B -0.742551
C  0.347408
D  0.276034
E  0.399468

Still I don't see what this really achieves for you that is better than the original returned Series? 还是我看不出这真正为您带来的效果要比原始返回的Series好吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM