簡體   English   中英

pandas dataframe - 從少於X行的組中刪除值

[英]pandas dataframe - remove values from a group with less than X rows

我需要從時間序列(每月頻率)計算一個標准均值,但我還需要從計算中排除“不完整”年份(少於12個月)

Numpy / scipy“工作”版本:

import numpy as np
import scipy.stats as sts

url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
npdata = np.genfromtxt(url, skip_header=1)
unique_enso_year = [int(value) for value in set(npdata[:, 0])]
nin34 = np.zeros(len(unique_enso_year))
for ind, year in enumerate(unique_enso_year):
    indexes = np.flatnonzero(npdata[:, 0]==year)
    if len(indexes) == 12:
        nin34[ind] = np.mean(npdata[indexes, 9])
    else:
        nin34[ind] = np.nan

nin34x = (nin34 - sts.nanmean(nin34)) / sts.nanstd(nin34)

array([[  1.02250000e+00,   5.15000000e-01,  -6.73333333e-01,
     -7.02500000e-01,   1.16666667e-01,   1.32916667e+00,
     -1.10333333e+00,  -8.11666667e-01,   1.51666667e-01,
      6.42500000e-01,   6.49166667e-01,   3.71666667e-01,
      4.05000000e-01,  -1.98333333e-01,  -4.79166667e-01,
      1.24666667e+00,  -1.44166667e-01,  -1.18166667e+00,
     -8.89166667e-01,  -2.51666667e-01,   7.36666667e-01,
      3.02500000e-01,   3.83333333e-01,   1.19166667e-01,
      1.70833333e-01,  -5.25000000e-01,  -7.35000000e-01,
      3.75000000e-01,  -4.50833333e-01,  -8.30000000e-01,
     -1.41666667e-02,              nan]])

熊貓嘗試:

import pandas as pd
from datetime import datetime

def parse(yr, mon):
    date = datetime(year=int(yr), day=2, month=int(mon))
    return date


url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
data = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse)                     
grouped = data.groupby(lambda x: x.year)

zscore = lambda x: (x - x.mean()) / x.std()
transformed = grouped.transform(zscore)
print transformed['ANOM.3'] 

YR_MON
1982-01-02   -0.986922
1982-02-02   -1.179216
1982-03-02   -1.179216
1982-04-02   -0.885119
1982-05-02   -0.376105
1982-06-02    0.087664
1982-07-02   -0.161188
1982-08-02    0.098975
1982-09-02    0.415695
1982-10-02    1.049134
1982-11-02    1.286674
1982-12-02    1.829622
1983-01-02    1.715072
1983-02-02    1.428598
1983-03-02    0.976272
...
2012-03-02   -0.999284
2012-04-02   -0.663736
2012-05-02   -0.063283
2012-06-02    0.572491
2012-07-02    0.961020
2012-08-02    1.314227
2012-09-02    0.925699
2012-10-02    0.537170
2012-11-02    0.660793
2012-12-02   -0.169245
2013-01-02   -1.001483
2013-02-02   -0.924445
2013-03-02    0.462223
2013-04-02    1.386668
2013-05-02    0.077037
Name: ANOM.3, Length: 377, dtype: float64

這不是我想要的......因為數量也是2013年(僅有5個月)

提取我想要的東西,我需要做類似的事情:

(grouped.mean()['ANOM.3'][:-1] - sts.nanmean(grouped.mean()['ANOM.3'][:-1])) / sts.nanstd(grouped.mean()['ANOM.3'][:-1])

但這假設我現在已經知道,去年是不完整的,然后我松開了np.NAN,我應該有2013年的價值

所以我現在正試圖在熊貓中進行查詢,如:

grouped2 = data.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None).reset_index(drop=True)

這給了我“正確的價值”..但這產生了一個新的數據框“沒有帶時間戳的索引”..我敢肯定有一個簡單而美麗的方式來做它...感謝任何幫助!

我發現了這種方式:

import pandas as pd

url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'

ts_raw = pd.read_table(url, 
                        sep=' ', 
                        header=0, 
                        skiprows=0, 
                        parse_dates = [['YR', 'MON']], 
                        skipinitialspace=True, 
                        index_col=0, 
                        date_parser=parse)                     
ts_year_group = ts_raw.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None) 
ts_range = pd.date_range(ts_year_group.index[0][1], 
                         ts_year_group.index[-1][1]+pd.DateOffset(months=1), 
                         freq="M")
ts = pd.DataFrame(ts_year_group.values, 
                  index=ts_range, 
                  columns=ts_year_group.keys())
ts_fullyears_group = ts.groupby(lambda x: x.year)
nin_anomalies = (grouped.mean()['ANOM.3'] - sts.nanmean(grouped.mean()['ANOM.3'])) / sts.nanstd(grouped.mean()['ANOM.3'])

nin_anomalies

1982    1.527215
1983    0.779877
1984   -0.970047
1985   -1.012997
1986    0.193297
1987    1.978809
1988   -1.603259
1989   -1.173755
1990    0.244837
1991    0.967632
1992    0.977449
1993    0.568807
1994    0.617893
1995   -0.270568
1996   -0.684120
1997    1.857320
1998   -0.190803
1999   -1.718612
2000   -1.287880
2001   -0.349106
2002    1.106301
2003    0.466953
2004    0.585987
2005    0.196978
2006    0.273062
2007   -0.751613
2008   -1.060856
2009    0.573715
2010   -0.642396
2011   -1.200752
2012    0.000633
Name: ANOM.3, dtype: float64

我相信有更好的方法可以做同樣的事情:/

這是一個解決方案,因為你的約會時間是每個月的2號,所以有時會有些討厭。

開始時:

In [205]: import pandas as pd

In [206]: from datetime import datetime

In [207]: from datetime import timedelta

In [208]: 

In [208]: def parse(yr, mon):
   .....:         date = datetime(year=int(yr), day=2, month=int(mon))
   .....:         return date
   .....: 

In [209]: 

In [209]: url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'

In [210]: data = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse)                     

In [211]: grouped = data.groupby(lambda x: x.year)

獲得整整年份:

In [212]: full_year = grouped['NINO1+2'].count() == 12

In [213]: full_year
Out[213]: 
1982     True
1983     True
1984     True
1985     True
1986     True
1987     True
1988     True
1989     True
1990     True
1991     True
1992     True
1993     True
1994     True
1995     True
1996     True
1997     True
1998     True
1999     True
2000     True
2001     True
2002     True
2003     True
2004     True
2005     True
2006     True
2007     True
2008     True
2009     True
2010     True
2011     True
2012     True
2013    False
dtype: bool

現在我們處理以正確的數據類型獲取索引並對齊。 這可能會簡化一點:

In [214]: strt = data.index[0] - timedelta(1)
In [215]: idx = pd.DatetimeIndex(start=strt, periods=len(full_year - 1), freq='BA-JAN')

In [216]: idx = idx + timedelta(1)  # Get to 2nd of each month

In [232]: idx
Out[232]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[1982-01-02 00:00:00, ..., 2013-01-02 00:00:00]
Length: 32, Freq: None, Timezone: None

In [233]: full_year.index = idx

這是關鍵步驟:

In [234]: full_year = full_year.reindex_like(data, method='ffill')

希望這是正確的:

In [235]: data.ix[full_year].tail()
Out[235]: 
            NINO1+2  ANOM  NINO3  ANOM.1  NINO4  ANOM.2  NINO3.4  ANOM.3  \
YR_MON                                                                     
2012-08-02    20.99  0.35  25.72    0.73  29.10    0.42    27.55    0.73   
2012-09-02    20.83  0.49  25.28    0.43  29.12    0.43    27.24    0.51   
2012-10-02    20.68 -0.11  24.93    0.01  29.16    0.50    26.98    0.29   
2012-11-02    21.21 -0.38  25.11    0.14  29.17    0.54    27.01    0.36   
2012-12-02    22.13 -0.68  24.91   -0.23  28.71    0.23    26.46   -0.11   

            Unnamed: 10  
YR_MON                   
2012-08-02          NaN  
2012-09-02          NaN  
2012-10-02          NaN  
2012-11-02          NaN  
2012-12-02          NaN  

只需處理data.ix [full_year]就可以了。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM