簡體   English   中英

熊貓:非唯一索引的外部聯接

[英]Pandas: Outer Join on Non-Unique Index

我有一個具有MultiIndex的數據框,如下所示:

>>> dfNew.head()
                 status  shopping        TUFNWGTP
state date                                       
6     2003-01-03    emp         0  8155462.672158
      2003-01-03    emp         0  8155462.672158
      2003-01-03    emp         0  8155462.672158
      2003-01-04    emp         0  1735322.527819
      2003-01-04    emp         0  1735322.527819

您在這里看不到它,但是status可以采用三個值: empunempNaN 這是狀態日期級別的數據。 我想加入頻率不同的新狀態日期數據,然后隨時間進行匯總/分組。

>>> test['foo'].head()
state  date      
1      2004-01-01     1985886
2      2004-01-01      301172
4      2004-01-01     2614525
5      2004-01-01     1180409
6      2004-01-01    16098932

不需內心加入

這是我的工作:

dfNew = dfNew.join(test['foo'], method)
dfNew.reset_index(level=0, inplace=True)
doWhat = {'shopping' : np.sum, 'TUFNWGTP': np.sum, 'foo' : np.mean}
aggASS = dfNew.groupby(['state', pd.TimeGrouper("2AS", label='left'), 'status']).agg(doWhat)

應該

  • 對於每個日期時間組合,從另一個數據庫中加入foo ,並創建基於2年的值。

但是我得到的是:

>>> aggASS.head()
                                    foo      shopping      TUFNWGTP
state date       status                                            
1     2003-01-01 emp     2007116.941176  2.910812e+12  4.500711e+09
                 unemp              NaN  7.836728e+11  5.590089e+08
      2005-01-01 emp     2062059.100000  2.026485e+12  4.440291e+09
                 unemp   2078869.000000  7.543956e+10  2.638597e+08

觀察foo對於相同的statedate如何具有status=emp的值,但是沒有status=unemp的值。

加入how = inner

join默認情況下是how=inner ,所以這似乎是個問題。 但是,如果我

>>> dfNew = dfNew.join(test['foo'], how='outer')
NotImplementedError: Index._join_level on non-unique index is not implemented

是的, state - date在這里不是唯一的。 但據我所知,我想要的還是有意義的 (不是嗎?)。 在這里進行有效的工作是什么?

建議的解決方案:追加為列

建議的解決方案是將它們附加為列:

使用sort level對齊數據框:

>>> dfNew.head()
                 status  shopping        TUFNWGTP
state date                                       
1     2003-01-01    emp         0  3227364.873298
      2003-01-01    NaN         0  6841114.725821
      2003-01-01    NaN         0  6841114.725821
      2003-01-01    NaN         0  6841114.725821
      2003-01-01    NaN         0  6841114.725821
>>> test['foo'].head()
state  date      
1      2004-01-01    1985886
       2004-02-01    1990082
       2004-03-01    1999936
       2004-04-01    2009556
       2004-05-01    2009573

然后,我們將第二個時間序列添加為dfNew.append(test['foo']) 有人建議我用ignore_index=True ,但是我認為因為索引標簽是正確的,所以我們不需要。

但是,這使我的Python實例崩潰。 這是數據幀的大小:

>>> len(test['foo'])
6864
>>> len(dfNew)
404394

這是我采取的一些步驟。 希望這可以引導您走上解決方案的道路。

我重新創建了多索引數據框和您提供的時間序列:

In [118]: newdf
Out[118]: 
                      0           1                2
state date                                          
1     2003-01-01    emp           0   3227364.873298
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-02    NaN           0   5834127.649776
      2003-01-02    NaN           0   5834127.649776
      2003-01-04    emp  2100942000   1506051.861585
      2003-01-04    emp  2100942000   1506051.861585
      2003-01-04    emp  5412841000   1204191.605090
      2003-01-04    emp  5412841000   1204191.605090
      2003-01-04    emp  5412841000   1204191.605090
      2003-01-05    NaN           0   1765953.711812
      2003-01-05    NaN           0   1765953.711812
      2003-01-05    emp           0   1434858.212964
      2003-01-05    emp           0   1434858.212964
      2003-01-05    emp           0   1434858.212964
      2003-01-05    emp           0   1811326.258197
      2003-01-05    emp           0   1811326.258197
      2003-01-05    NaN           0   1908483.149300
      2003-01-05    NaN           0   1908483.149300
      2003-01-06    NaN  1298934000   4190110.086256
      2003-01-07    NaN           0   6241047.457860
      2003-01-07    NaN           0   6241047.457860
      2003-01-07    NaN           0   6241047.457860
      2003-01-07    NaN           0   6241047.457860
      2003-01-08    emp   715231400   4614396.137509
      2003-01-08    emp   715231400   4614396.137509
      2003-01-08    emp   715231400   4614396.137509
2     2013-08-01    emp           0  10571046.129186
      2013-08-01    emp           0  10571046.129186
      2013-08-01    emp           0  10571046.129186
      2013-08-01    emp           0  10571046.129186
      2013-08-27    NaN  6804297000   3376822.385266
      2013-08-27    NaN  6804297000   3376822.385266
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-10-18    emp           0  14402621.620998
      2013-10-18    emp           0  14402621.620998
      2013-11-02  unemp           0   7778017.482167
      2013-11-02  unemp           0   7778017.482167
      2013-11-02  unemp           0   7778017.482167
      2013-11-09    NaN           0   2164565.290873
      2013-11-09    NaN           0   2164565.290873
      2013-11-10    emp   527859500   1759531.507169
      2013-11-10    emp   527859500   1759531.507169
      2013-11-24    emp           0   3050339.003118
      2013-11-24    emp           0   3050339.003118
      2013-11-24    emp           0   3050339.003118
      2013-11-29    NaN           0  11224606.711441
      2013-11-29    NaN           0  11224606.711441
      2013-12-12    emp           0  13804339.863606
      2013-12-12    emp           0  13804339.863606
      2013-12-12    emp           0  13804339.863606
      2013-12-12    emp           0  13804339.863606

In [120]: newfoo
Out[120]: 
                      foo
state date               
1     2004-01-01  1985886
      2004-02-01  1990082
      2004-03-01  1999936
      2004-04-01  2009556
      2004-05-01  2009573
      2004-06-01  2013057
      2004-07-01  2019963
      2004-08-01  2015320
      2004-09-01  2015103
      2004-10-01  2035705
      2004-11-01  2043152
      2004-12-01  2041339
      2005-01-01  2011219
      2005-02-01  2014928
      2005-03-01  2028597
2     2013-10-01   340483
      2013-11-01   338445
      2013-12-01   336903
      2014-01-01   334565
      2014-02-01   334667
      2014-03-01   335922
      2014-04-01   337188
      2014-05-01   343958
      2014-06-01   349122
      2014-07-01   354911
      2014-08-01   350833
      2014-09-01   344849
      2014-10-01   341434
      2014-11-01   339866
      2014-12-01   339203

我展平了數據幀和時間序列:

   In [147]: flattenednewdf
Out[147]: 
    state       date status    shopping         TUFNWGTP
0       1 2003-01-01    emp           0   3227364.873298
1       1 2003-01-01    NaN           0   6841114.725821
2       1 2003-01-01    NaN           0   6841114.725821
3       1 2003-01-01    NaN           0   6841114.725821
4       1 2003-01-01    NaN           0   6841114.725821
5       1 2003-01-01    NaN           0   6841114.725821
6       1 2003-01-02    NaN           0   5834127.649776
7       1 2003-01-02    NaN           0   5834127.649776
8       1 2003-01-04    emp  2100942000   1506051.861585
9       1 2003-01-04    emp  2100942000   1506051.861585
10      1 2003-01-04    emp  5412841000   1204191.605090
11      1 2003-01-04    emp  5412841000   1204191.605090
12      1 2003-01-04    emp  5412841000   1204191.605090
13      1 2003-01-05    NaN           0   1765953.711812
14      1 2003-01-05    NaN           0   1765953.711812
15      1 2003-01-05    emp           0   1434858.212964
16      1 2003-01-05    emp           0   1434858.212964
17      1 2003-01-05    emp           0   1434858.212964
18      1 2003-01-05    emp           0   1811326.258197
19      1 2003-01-05    emp           0   1811326.258197
20      1 2003-01-05    NaN           0   1908483.149300
21      1 2003-01-05    NaN           0   1908483.149300
22      1 2003-01-06    NaN  1298934000   4190110.086256
23      1 2003-01-07    NaN           0   6241047.457860
24      1 2003-01-07    NaN           0   6241047.457860
25      1 2003-01-07    NaN           0   6241047.457860
26      1 2003-01-07    NaN           0   6241047.457860
27      1 2003-01-08    emp   715231400   4614396.137509
28      1 2003-01-08    emp   715231400   4614396.137509
29      1 2003-01-08    emp   715231400   4614396.137509
30      2 2013-08-01    emp           0  10571046.129186
31      2 2013-08-01    emp           0  10571046.129186
32      2 2013-08-01    emp           0  10571046.129186
33      2 2013-08-01    emp           0  10571046.129186
34      2 2013-08-27    NaN  6804297000   3376822.385266
35      2 2013-08-27    NaN  6804297000   3376822.385266
36      2 2013-09-28    NaN           0   4645591.067481
37      2 2013-09-28    NaN           0   4645591.067481
38      2 2013-09-28    NaN           0   4645591.067481
39      2 2013-09-28    NaN           0   4645591.067481
40      2 2013-09-28    NaN           0   4645591.067481
41      2 2013-09-28    NaN           0   4645591.067481
42      2 2013-10-18    emp           0  14402621.620998
43      2 2013-10-18    emp           0  14402621.620998
44      2 2013-11-02  unemp           0   7778017.482167
45      2 2013-11-02  unemp           0   7778017.482167
46      2 2013-11-02  unemp           0   7778017.482167
47      2 2013-11-09    NaN           0   2164565.290873
48      2 2013-11-09    NaN           0   2164565.290873
49      2 2013-11-10    emp   527859500   1759531.507169
50      2 2013-11-10    emp   527859500   1759531.507169
51      2 2013-11-24    emp           0   3050339.003118
52      2 2013-11-24    emp           0   3050339.003118
53      2 2013-11-24    emp           0   3050339.003118
54      2 2013-11-29    NaN           0  11224606.711441
55      2 2013-11-29    NaN           0  11224606.711441
56      2 2013-12-12    emp           0  13804339.863606
57      2 2013-12-12    emp           0  13804339.863606
58      2 2013-12-12    emp           0  13804339.863606
59      2 2013-12-12    emp           0  13804339.863606


In [143]: flattenedfoo
Out[143]: 
    state       date      foo
0       1 2004-01-01  1985886
1       1 2004-02-01  1990082
2       1 2004-03-01  1999936
3       1 2004-04-01  2009556
4       1 2004-05-01  2009573
5       1 2004-06-01  2013057
6       1 2004-07-01  2019963
7       1 2004-08-01  2015320
8       1 2004-09-01  2015103
9       1 2004-10-01  2035705
10      1 2004-11-01  2043152
11      1 2004-12-01  2041339
12      1 2005-01-01  2011219
13      1 2005-02-01  2014928
14      1 2005-03-01  2028597
15      2 2013-10-01   340483
16      2 2013-11-01   338445
17      2 2013-12-01   336903
18      2 2014-01-01   334565
19      2 2014-02-01   334667
20      2 2014-03-01   335922
21      2 2014-04-01   337188
22      2 2014-05-01   343958
23      2 2014-06-01   349122
24      2 2014-07-01   354911
25      2 2014-08-01   350833
26      2 2014-09-01   344849
27      2 2014-10-01   341434
28      2 2014-11-01   339866
29      2 2014-12-01   339203

我將時間序列附加到數據框。 我將行數和列數留在了底部,因此根據您提供的示例,您可以驗證這是否是正確的數據幀大小:

In [149]: final_df
Out[149]: 
          TUFNWGTP       date      foo    shopping  state status
0   3227364.873298 2003-01-01      NaN           0      1    emp
1   6841114.725821 2003-01-01      NaN           0      1    NaN
2   6841114.725821 2003-01-01      NaN           0      1    NaN
3   6841114.725821 2003-01-01      NaN           0      1    NaN
4   6841114.725821 2003-01-01      NaN           0      1    NaN
5   6841114.725821 2003-01-01      NaN           0      1    NaN
6   5834127.649776 2003-01-02      NaN           0      1    NaN
7   5834127.649776 2003-01-02      NaN           0      1    NaN
8   1506051.861585 2003-01-04      NaN  2100942000      1    emp
9   1506051.861585 2003-01-04      NaN  2100942000      1    emp
10  1204191.605090 2003-01-04      NaN  5412841000      1    emp
11  1204191.605090 2003-01-04      NaN  5412841000      1    emp
12  1204191.605090 2003-01-04      NaN  5412841000      1    emp
13  1765953.711812 2003-01-05      NaN           0      1    NaN
14  1765953.711812 2003-01-05      NaN           0      1    NaN
15  1434858.212964 2003-01-05      NaN           0      1    emp
16  1434858.212964 2003-01-05      NaN           0      1    emp
17  1434858.212964 2003-01-05      NaN           0      1    emp
18  1811326.258197 2003-01-05      NaN           0      1    emp
19  1811326.258197 2003-01-05      NaN           0      1    emp
20  1908483.149300 2003-01-05      NaN           0      1    NaN
21  1908483.149300 2003-01-05      NaN           0      1    NaN
22  4190110.086256 2003-01-06      NaN  1298934000      1    NaN
23  6241047.457860 2003-01-07      NaN           0      1    NaN
24  6241047.457860 2003-01-07      NaN           0      1    NaN
25  6241047.457860 2003-01-07      NaN           0      1    NaN
26  6241047.457860 2003-01-07      NaN           0      1    NaN
27  4614396.137509 2003-01-08      NaN   715231400      1    emp
28  4614396.137509 2003-01-08      NaN   715231400      1    emp
29  4614396.137509 2003-01-08      NaN   715231400      1    emp
..             ...        ...      ...         ...    ...    ...
0              NaN 2004-01-01  1985886         NaN      1    NaN
1              NaN 2004-02-01  1990082         NaN      1    NaN
2              NaN 2004-03-01  1999936         NaN      1    NaN
3              NaN 2004-04-01  2009556         NaN      1    NaN
4              NaN 2004-05-01  2009573         NaN      1    NaN
5              NaN 2004-06-01  2013057         NaN      1    NaN
6              NaN 2004-07-01  2019963         NaN      1    NaN
7              NaN 2004-08-01  2015320         NaN      1    NaN
8              NaN 2004-09-01  2015103         NaN      1    NaN
9              NaN 2004-10-01  2035705         NaN      1    NaN
10             NaN 2004-11-01  2043152         NaN      1    NaN
11             NaN 2004-12-01  2041339         NaN      1    NaN
12             NaN 2005-01-01  2011219         NaN      1    NaN
13             NaN 2005-02-01  2014928         NaN      1    NaN
14             NaN 2005-03-01  2028597         NaN      1    NaN
15             NaN 2013-10-01   340483         NaN      2    NaN
16             NaN 2013-11-01   338445         NaN      2    NaN
17             NaN 2013-12-01   336903         NaN      2    NaN
18             NaN 2014-01-01   334565         NaN      2    NaN
19             NaN 2014-02-01   334667         NaN      2    NaN
20             NaN 2014-03-01   335922         NaN      2    NaN
21             NaN 2014-04-01   337188         NaN      2    NaN
22             NaN 2014-05-01   343958         NaN      2    NaN
23             NaN 2014-06-01   349122         NaN      2    NaN
24             NaN 2014-07-01   354911         NaN      2    NaN
25             NaN 2014-08-01   350833         NaN      2    NaN
26             NaN 2014-09-01   344849         NaN      2    NaN
27             NaN 2014-10-01   341434         NaN      2    NaN
28             NaN 2014-11-01   339866         NaN      2    NaN
29             NaN 2014-12-01   339203         NaN      2    NaN

[90 rows x 6 columns]

建立時間垃圾桶對我來說是新的,但是要使用您提供的方法,我必須將索引設置回日期列。 我創建了一個新的數據框,因為此過程中的很多過程都是實驗性的,我不想重建舊的數據框:

final_df_2 = final_df.set_index(['date'])

從這一點上,您應該能夠進行所需的任何計算。 我在下面根據您的代碼運行了一些代碼,但是問題是我們非常有選擇地進行分組,因此結果看起來很奇怪:

In [187]: doWhat = {'shopping' : np.sum, 'TUFNWGTP': np.sum, 'foo' : np.mean}

In [188]: aggASS = final_df_2.groupby([pd.TimeGrouper("2AS", label='left')]).agg(doWhat)
In [189]: aggASS
Out[189]: 
                       foo     shopping      TUFNWGTP
date                                                 
2003-01-01  2014889.333333  23885035200  1.139995e+08
2005-01-01  2018248.000000          NaN           NaN
2013-01-01   341489.933333  14664313000  2.237165e+08

In [190]: aggASS = final_df_2.groupby(['state', pd.TimeGrouper("2AS", label='left'), 'status']).agg(doWhat)

In [191]: aggASS
Out[191]: 
                         foo     shopping      TUFNWGTP
state date       status                                
1     2003-01-01 emp     NaN  22586101200  3.162246e+07
2     2013-01-01 emp     NaN   1055719000  1.389769e+08
                 unemp   NaN            0  2.333405e+07

我讀了另一篇有關使用cut方法進行存儲的文章。 您可以在此處閱讀- 按值范圍對數據進行分組 我認為您可以使用datetime對象操作來構建2年存儲桶。

這是@ kennes913答案的相關部分,僅用於將來的訪客概述:

# flatten the data frames. For overview, just select one column each
df1flat = df.reset_index()[['state', 'date', 'TUFNWGTP']]
df2flat = df_emp.reset_index()[['state', 'date', 'foo']]
# the "merge"
X = df1flat.append(df2flat)
# now, recover the original data frames:
test1 = X.loc[np.isnan(X.foo) == False, ['state', 'date', 'foo']]
# fix dtype which was lost in the merge
test1['state'] = test1['state'].astype(int)

test2 = X.loc[np.isnan(X.TUCASEID) == False, ['state', 'date', 'TUFNWGTP']]
# check if nothing was lost:
print assert_frame_equal(bar, test1) # output: None
print assert_frame_equal(foo, test2) # output: None

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM