[英]Pandas: Outer Join on Non-Unique Index
我有一個具有MultiIndex的數據框,如下所示:
>>> dfNew.head()
status shopping TUFNWGTP
state date
6 2003-01-03 emp 0 8155462.672158
2003-01-03 emp 0 8155462.672158
2003-01-03 emp 0 8155462.672158
2003-01-04 emp 0 1735322.527819
2003-01-04 emp 0 1735322.527819
您在這里看不到它,但是status
可以采用三個值: emp
, unemp
, NaN
。 這是狀態日期級別的數據。 我想加入頻率不同的新狀態日期數據,然后隨時間進行匯總/分組。
>>> test['foo'].head()
state date
1 2004-01-01 1985886
2 2004-01-01 301172
4 2004-01-01 2614525
5 2004-01-01 1180409
6 2004-01-01 16098932
這是我的工作:
dfNew = dfNew.join(test['foo'], method)
dfNew.reset_index(level=0, inplace=True)
doWhat = {'shopping' : np.sum, 'TUFNWGTP': np.sum, 'foo' : np.mean}
aggASS = dfNew.groupby(['state', pd.TimeGrouper("2AS", label='left'), 'status']).agg(doWhat)
這應該 :
foo
,並創建基於2年的值。 但是我得到的是:
>>> aggASS.head()
foo shopping TUFNWGTP
state date status
1 2003-01-01 emp 2007116.941176 2.910812e+12 4.500711e+09
unemp NaN 7.836728e+11 5.590089e+08
2005-01-01 emp 2062059.100000 2.026485e+12 4.440291e+09
unemp 2078869.000000 7.543956e+10 2.638597e+08
觀察foo
對於相同的state
和date
如何具有status=emp
的值,但是沒有status=unemp
的值。
join
默認情況下是how=inner
,所以這似乎是個問題。 但是,如果我
>>> dfNew = dfNew.join(test['foo'], how='outer')
NotImplementedError: Index._join_level on non-unique index is not implemented
是的, state
- date
在這里不是唯一的。 但據我所知,我想要的還是有意義的 (不是嗎?)。 在這里進行有效的工作是什么?
建議的解決方案是將它們附加為列:
使用sort level
對齊數據框:
>>> dfNew.head()
status shopping TUFNWGTP
state date
1 2003-01-01 emp 0 3227364.873298
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
>>> test['foo'].head()
state date
1 2004-01-01 1985886
2004-02-01 1990082
2004-03-01 1999936
2004-04-01 2009556
2004-05-01 2009573
然后,我們將第二個時間序列添加為dfNew.append(test['foo'])
。 有人建議我用ignore_index=True
,但是我認為因為索引標簽是正確的,所以我們不需要。
但是,這使我的Python實例崩潰。 這是數據幀的大小:
>>> len(test['foo'])
6864
>>> len(dfNew)
404394
dfNew
一些粘貼: http : dfNew
test
一些粘貼: http : //pastebin.com/Er70XD9y 這是我采取的一些步驟。 希望這可以引導您走上解決方案的道路。
我重新創建了多索引數據框和您提供的時間序列:
In [118]: newdf
Out[118]:
0 1 2
state date
1 2003-01-01 emp 0 3227364.873298
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-01 NaN 0 6841114.725821
2003-01-02 NaN 0 5834127.649776
2003-01-02 NaN 0 5834127.649776
2003-01-04 emp 2100942000 1506051.861585
2003-01-04 emp 2100942000 1506051.861585
2003-01-04 emp 5412841000 1204191.605090
2003-01-04 emp 5412841000 1204191.605090
2003-01-04 emp 5412841000 1204191.605090
2003-01-05 NaN 0 1765953.711812
2003-01-05 NaN 0 1765953.711812
2003-01-05 emp 0 1434858.212964
2003-01-05 emp 0 1434858.212964
2003-01-05 emp 0 1434858.212964
2003-01-05 emp 0 1811326.258197
2003-01-05 emp 0 1811326.258197
2003-01-05 NaN 0 1908483.149300
2003-01-05 NaN 0 1908483.149300
2003-01-06 NaN 1298934000 4190110.086256
2003-01-07 NaN 0 6241047.457860
2003-01-07 NaN 0 6241047.457860
2003-01-07 NaN 0 6241047.457860
2003-01-07 NaN 0 6241047.457860
2003-01-08 emp 715231400 4614396.137509
2003-01-08 emp 715231400 4614396.137509
2003-01-08 emp 715231400 4614396.137509
2 2013-08-01 emp 0 10571046.129186
2013-08-01 emp 0 10571046.129186
2013-08-01 emp 0 10571046.129186
2013-08-01 emp 0 10571046.129186
2013-08-27 NaN 6804297000 3376822.385266
2013-08-27 NaN 6804297000 3376822.385266
2013-09-28 NaN 0 4645591.067481
2013-09-28 NaN 0 4645591.067481
2013-09-28 NaN 0 4645591.067481
2013-09-28 NaN 0 4645591.067481
2013-09-28 NaN 0 4645591.067481
2013-09-28 NaN 0 4645591.067481
2013-10-18 emp 0 14402621.620998
2013-10-18 emp 0 14402621.620998
2013-11-02 unemp 0 7778017.482167
2013-11-02 unemp 0 7778017.482167
2013-11-02 unemp 0 7778017.482167
2013-11-09 NaN 0 2164565.290873
2013-11-09 NaN 0 2164565.290873
2013-11-10 emp 527859500 1759531.507169
2013-11-10 emp 527859500 1759531.507169
2013-11-24 emp 0 3050339.003118
2013-11-24 emp 0 3050339.003118
2013-11-24 emp 0 3050339.003118
2013-11-29 NaN 0 11224606.711441
2013-11-29 NaN 0 11224606.711441
2013-12-12 emp 0 13804339.863606
2013-12-12 emp 0 13804339.863606
2013-12-12 emp 0 13804339.863606
2013-12-12 emp 0 13804339.863606
In [120]: newfoo
Out[120]:
foo
state date
1 2004-01-01 1985886
2004-02-01 1990082
2004-03-01 1999936
2004-04-01 2009556
2004-05-01 2009573
2004-06-01 2013057
2004-07-01 2019963
2004-08-01 2015320
2004-09-01 2015103
2004-10-01 2035705
2004-11-01 2043152
2004-12-01 2041339
2005-01-01 2011219
2005-02-01 2014928
2005-03-01 2028597
2 2013-10-01 340483
2013-11-01 338445
2013-12-01 336903
2014-01-01 334565
2014-02-01 334667
2014-03-01 335922
2014-04-01 337188
2014-05-01 343958
2014-06-01 349122
2014-07-01 354911
2014-08-01 350833
2014-09-01 344849
2014-10-01 341434
2014-11-01 339866
2014-12-01 339203
我展平了數據幀和時間序列:
In [147]: flattenednewdf
Out[147]:
state date status shopping TUFNWGTP
0 1 2003-01-01 emp 0 3227364.873298
1 1 2003-01-01 NaN 0 6841114.725821
2 1 2003-01-01 NaN 0 6841114.725821
3 1 2003-01-01 NaN 0 6841114.725821
4 1 2003-01-01 NaN 0 6841114.725821
5 1 2003-01-01 NaN 0 6841114.725821
6 1 2003-01-02 NaN 0 5834127.649776
7 1 2003-01-02 NaN 0 5834127.649776
8 1 2003-01-04 emp 2100942000 1506051.861585
9 1 2003-01-04 emp 2100942000 1506051.861585
10 1 2003-01-04 emp 5412841000 1204191.605090
11 1 2003-01-04 emp 5412841000 1204191.605090
12 1 2003-01-04 emp 5412841000 1204191.605090
13 1 2003-01-05 NaN 0 1765953.711812
14 1 2003-01-05 NaN 0 1765953.711812
15 1 2003-01-05 emp 0 1434858.212964
16 1 2003-01-05 emp 0 1434858.212964
17 1 2003-01-05 emp 0 1434858.212964
18 1 2003-01-05 emp 0 1811326.258197
19 1 2003-01-05 emp 0 1811326.258197
20 1 2003-01-05 NaN 0 1908483.149300
21 1 2003-01-05 NaN 0 1908483.149300
22 1 2003-01-06 NaN 1298934000 4190110.086256
23 1 2003-01-07 NaN 0 6241047.457860
24 1 2003-01-07 NaN 0 6241047.457860
25 1 2003-01-07 NaN 0 6241047.457860
26 1 2003-01-07 NaN 0 6241047.457860
27 1 2003-01-08 emp 715231400 4614396.137509
28 1 2003-01-08 emp 715231400 4614396.137509
29 1 2003-01-08 emp 715231400 4614396.137509
30 2 2013-08-01 emp 0 10571046.129186
31 2 2013-08-01 emp 0 10571046.129186
32 2 2013-08-01 emp 0 10571046.129186
33 2 2013-08-01 emp 0 10571046.129186
34 2 2013-08-27 NaN 6804297000 3376822.385266
35 2 2013-08-27 NaN 6804297000 3376822.385266
36 2 2013-09-28 NaN 0 4645591.067481
37 2 2013-09-28 NaN 0 4645591.067481
38 2 2013-09-28 NaN 0 4645591.067481
39 2 2013-09-28 NaN 0 4645591.067481
40 2 2013-09-28 NaN 0 4645591.067481
41 2 2013-09-28 NaN 0 4645591.067481
42 2 2013-10-18 emp 0 14402621.620998
43 2 2013-10-18 emp 0 14402621.620998
44 2 2013-11-02 unemp 0 7778017.482167
45 2 2013-11-02 unemp 0 7778017.482167
46 2 2013-11-02 unemp 0 7778017.482167
47 2 2013-11-09 NaN 0 2164565.290873
48 2 2013-11-09 NaN 0 2164565.290873
49 2 2013-11-10 emp 527859500 1759531.507169
50 2 2013-11-10 emp 527859500 1759531.507169
51 2 2013-11-24 emp 0 3050339.003118
52 2 2013-11-24 emp 0 3050339.003118
53 2 2013-11-24 emp 0 3050339.003118
54 2 2013-11-29 NaN 0 11224606.711441
55 2 2013-11-29 NaN 0 11224606.711441
56 2 2013-12-12 emp 0 13804339.863606
57 2 2013-12-12 emp 0 13804339.863606
58 2 2013-12-12 emp 0 13804339.863606
59 2 2013-12-12 emp 0 13804339.863606
In [143]: flattenedfoo
Out[143]:
state date foo
0 1 2004-01-01 1985886
1 1 2004-02-01 1990082
2 1 2004-03-01 1999936
3 1 2004-04-01 2009556
4 1 2004-05-01 2009573
5 1 2004-06-01 2013057
6 1 2004-07-01 2019963
7 1 2004-08-01 2015320
8 1 2004-09-01 2015103
9 1 2004-10-01 2035705
10 1 2004-11-01 2043152
11 1 2004-12-01 2041339
12 1 2005-01-01 2011219
13 1 2005-02-01 2014928
14 1 2005-03-01 2028597
15 2 2013-10-01 340483
16 2 2013-11-01 338445
17 2 2013-12-01 336903
18 2 2014-01-01 334565
19 2 2014-02-01 334667
20 2 2014-03-01 335922
21 2 2014-04-01 337188
22 2 2014-05-01 343958
23 2 2014-06-01 349122
24 2 2014-07-01 354911
25 2 2014-08-01 350833
26 2 2014-09-01 344849
27 2 2014-10-01 341434
28 2 2014-11-01 339866
29 2 2014-12-01 339203
我將時間序列附加到數據框。 我將行數和列數留在了底部,因此根據您提供的示例,您可以驗證這是否是正確的數據幀大小:
In [149]: final_df
Out[149]:
TUFNWGTP date foo shopping state status
0 3227364.873298 2003-01-01 NaN 0 1 emp
1 6841114.725821 2003-01-01 NaN 0 1 NaN
2 6841114.725821 2003-01-01 NaN 0 1 NaN
3 6841114.725821 2003-01-01 NaN 0 1 NaN
4 6841114.725821 2003-01-01 NaN 0 1 NaN
5 6841114.725821 2003-01-01 NaN 0 1 NaN
6 5834127.649776 2003-01-02 NaN 0 1 NaN
7 5834127.649776 2003-01-02 NaN 0 1 NaN
8 1506051.861585 2003-01-04 NaN 2100942000 1 emp
9 1506051.861585 2003-01-04 NaN 2100942000 1 emp
10 1204191.605090 2003-01-04 NaN 5412841000 1 emp
11 1204191.605090 2003-01-04 NaN 5412841000 1 emp
12 1204191.605090 2003-01-04 NaN 5412841000 1 emp
13 1765953.711812 2003-01-05 NaN 0 1 NaN
14 1765953.711812 2003-01-05 NaN 0 1 NaN
15 1434858.212964 2003-01-05 NaN 0 1 emp
16 1434858.212964 2003-01-05 NaN 0 1 emp
17 1434858.212964 2003-01-05 NaN 0 1 emp
18 1811326.258197 2003-01-05 NaN 0 1 emp
19 1811326.258197 2003-01-05 NaN 0 1 emp
20 1908483.149300 2003-01-05 NaN 0 1 NaN
21 1908483.149300 2003-01-05 NaN 0 1 NaN
22 4190110.086256 2003-01-06 NaN 1298934000 1 NaN
23 6241047.457860 2003-01-07 NaN 0 1 NaN
24 6241047.457860 2003-01-07 NaN 0 1 NaN
25 6241047.457860 2003-01-07 NaN 0 1 NaN
26 6241047.457860 2003-01-07 NaN 0 1 NaN
27 4614396.137509 2003-01-08 NaN 715231400 1 emp
28 4614396.137509 2003-01-08 NaN 715231400 1 emp
29 4614396.137509 2003-01-08 NaN 715231400 1 emp
.. ... ... ... ... ... ...
0 NaN 2004-01-01 1985886 NaN 1 NaN
1 NaN 2004-02-01 1990082 NaN 1 NaN
2 NaN 2004-03-01 1999936 NaN 1 NaN
3 NaN 2004-04-01 2009556 NaN 1 NaN
4 NaN 2004-05-01 2009573 NaN 1 NaN
5 NaN 2004-06-01 2013057 NaN 1 NaN
6 NaN 2004-07-01 2019963 NaN 1 NaN
7 NaN 2004-08-01 2015320 NaN 1 NaN
8 NaN 2004-09-01 2015103 NaN 1 NaN
9 NaN 2004-10-01 2035705 NaN 1 NaN
10 NaN 2004-11-01 2043152 NaN 1 NaN
11 NaN 2004-12-01 2041339 NaN 1 NaN
12 NaN 2005-01-01 2011219 NaN 1 NaN
13 NaN 2005-02-01 2014928 NaN 1 NaN
14 NaN 2005-03-01 2028597 NaN 1 NaN
15 NaN 2013-10-01 340483 NaN 2 NaN
16 NaN 2013-11-01 338445 NaN 2 NaN
17 NaN 2013-12-01 336903 NaN 2 NaN
18 NaN 2014-01-01 334565 NaN 2 NaN
19 NaN 2014-02-01 334667 NaN 2 NaN
20 NaN 2014-03-01 335922 NaN 2 NaN
21 NaN 2014-04-01 337188 NaN 2 NaN
22 NaN 2014-05-01 343958 NaN 2 NaN
23 NaN 2014-06-01 349122 NaN 2 NaN
24 NaN 2014-07-01 354911 NaN 2 NaN
25 NaN 2014-08-01 350833 NaN 2 NaN
26 NaN 2014-09-01 344849 NaN 2 NaN
27 NaN 2014-10-01 341434 NaN 2 NaN
28 NaN 2014-11-01 339866 NaN 2 NaN
29 NaN 2014-12-01 339203 NaN 2 NaN
[90 rows x 6 columns]
建立時間垃圾桶對我來說是新的,但是要使用您提供的方法,我必須將索引設置回日期列。 我創建了一個新的數據框,因為此過程中的很多過程都是實驗性的,我不想重建舊的數據框:
final_df_2 = final_df.set_index(['date'])
從這一點上,您應該能夠進行所需的任何計算。 我在下面根據您的代碼運行了一些代碼,但是問題是我們非常有選擇地進行分組,因此結果看起來很奇怪:
In [187]: doWhat = {'shopping' : np.sum, 'TUFNWGTP': np.sum, 'foo' : np.mean}
In [188]: aggASS = final_df_2.groupby([pd.TimeGrouper("2AS", label='left')]).agg(doWhat)
In [189]: aggASS
Out[189]:
foo shopping TUFNWGTP
date
2003-01-01 2014889.333333 23885035200 1.139995e+08
2005-01-01 2018248.000000 NaN NaN
2013-01-01 341489.933333 14664313000 2.237165e+08
In [190]: aggASS = final_df_2.groupby(['state', pd.TimeGrouper("2AS", label='left'), 'status']).agg(doWhat)
In [191]: aggASS
Out[191]:
foo shopping TUFNWGTP
state date status
1 2003-01-01 emp NaN 22586101200 3.162246e+07
2 2013-01-01 emp NaN 1055719000 1.389769e+08
unemp NaN 0 2.333405e+07
我讀了另一篇有關使用cut方法進行存儲的文章。 您可以在此處閱讀- 按值范圍對數據進行分組 。 我認為您可以使用datetime對象操作來構建2年存儲桶。
這是@ kennes913答案的相關部分,僅用於將來的訪客概述:
# flatten the data frames. For overview, just select one column each
df1flat = df.reset_index()[['state', 'date', 'TUFNWGTP']]
df2flat = df_emp.reset_index()[['state', 'date', 'foo']]
# the "merge"
X = df1flat.append(df2flat)
# now, recover the original data frames:
test1 = X.loc[np.isnan(X.foo) == False, ['state', 'date', 'foo']]
# fix dtype which was lost in the merge
test1['state'] = test1['state'].astype(int)
test2 = X.loc[np.isnan(X.TUCASEID) == False, ['state', 'date', 'TUFNWGTP']]
# check if nothing was lost:
print assert_frame_equal(bar, test1) # output: None
print assert_frame_equal(foo, test2) # output: None
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.