简体   繁体   English

使用熊猫的数据透视表

[英]Pivot tables using pandas

I have the following dataframe: 我有以下数据框:

df1= df[['rsa_units','regions','ssno','veteran','pos_off_ttl','occ_ser','grade','gender','ethnicity','age','age_category','service_time','type_appt','disabled','actn_dt','nat_actn_2_3','csc_auth_12','fy']]

this will produce 1.4 mil records. 这将产生140万条记录。 I've taken the first 12. 我已经拿了第一个12。

Eastern Region (R9),Eastern Region (R9),123456789,Non Vet,LBRER,3502,3,Male,White,43.0,Older Gen X'ers,5.0,Temporary,,2009-05-18 00:00:00,115,BDN,2009
Northern Region (R1),Northern Region (R1),234567891,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,7.0,Temporary,,2007-05-27 00:00:00,115,BDN,2007
Northern Region (R1),Northern Region (R1),345678912,Non Vet,FRSTRY AID,0462,3,Male,White,33.0,Younger Gen X'ers,8.0,Temporary,,2006-06-05 00:00:00,115,BDN,2006
Northern Research Station (NRS),Research & Development(RES),456789123,Non Vet,FRSTRY TECHNCN,0462,7,Male,White,37.0,Younger Gen X'ers,10.0,Term,,2006-11-26 00:00:00,702,N6M,2007
Intermountain Region (R4),Intermountain Region (R4),5678912345,Non Vet,BIOLCL SCI TECHNCN,0404,5,Male,White,45.0,Older Gen X'ers,6.0,Temporary,,2008-05-18 00:00:00,115,BWA,2008
Intermountain Region (R4),Intermountain Region (R4),678912345,Non Vet,FRSTRY AID (FIRE),0462,3,Female,White,31.0,Younger Gen X'ers,5.0,Temporary,,2009-05-10 00:00:00,115,BDN,2009
Pacific Southwest Region (R5),Pacific Southwest Region (R5),789123456,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2012-05-06 00:00:00,115,NAM,2012
Pacific Southwest Region (R5),Pacific Southwest Region (R5),891234567,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2011-06-05 00:00:00,115,BDN,2011
Intermountain Region (R4),Intermountain Region (R4),912345678,Non Vet,FRSTRY TECHNCN,0462,5,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2006-04-30 00:00:00,115,BDN,2006
Northern Region (R1),Northern Region (R1),987654321,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2005-04-11 00:00:00,115,BDN,2005
Southwest Region (R3),Southwest Region (R3),876543219,Non Vet,FRSTRY TECHNCN (HOTSHOT/HANDCREW),0462,4,Male,White,30.0,Gen Y Millennial,4.0,Temporary,,2013-03-24 00:00:00,115,NAM,2013
Southwest Region (R3),Southwest Region (R3),765432198,Non Vet,FRSTRY TECHNCN (RECR),0462,4,Male,White,30.0,Gen Y Millennial,5.0,Temporary,,2010-11-21 00:00:00,115,BDN,2011

I then filter on ['nat_actn_2_3'] for the certain hiring codes. 然后,我在['nat_actn_2_3']上过滤某些招聘代码。

h1 = df1[df1['nat_actn_2_3'].isin(['100','101','108','170','171','115','130','140','141','190','702','703'])]
h2 = h1.sort('ssno')
h3 = h2.drop_duplicates(['ssno','actn_dt'])

and can look at value_counts() to see total hires by region. 并可以查看value_counts()来按地区查看总员工。

total_newhires = h3['regions'].value_counts()
total_newhires

produces: 生产:

Out[38]:
Pacific Southwest Region (R5)      42255
Pacific Northwest Region (R6)      32081
Intermountain Region (R4)          24045
Northern Region (R1)               22822
Rocky Mountain Region (R2)         17481
Southwest Region (R3)              17305
Eastern Region (R9)                11034
Research & Development(RES)         7337
Southern Region (R8)                7288
Albuquerque Service Center(ASC)     7032
Washington Office(WO)               4837
Alaska Region (R10)                 4210
Job Corps(JC)                       4010
nda                                  438

I'd like to do something like in excel where I can have the ['regions'] as my row and the ['fy'] as the columns to give me a total count of numbers based off the ['ssno'] for each ['fy']. 我想在excel中做类似的事情,在其中可以将['regions']作为行,将['fy']作为列,以基于['ssno']为每个['fy']。 It would also be nice to eventually do calculations based off the numbers too, like averages and sums. 最终也可以根据数字进行计算,例如平均值和总和,这也很好。

Along with looking at examples in the url: http://pandas.pydata.org/pandas-docs/stable/reshaping.html , I've also tried: 除了查看网址中的示例: http : //pandas.pydata.org/pandas-docs/stable/reshaping.html ,我还尝试了:

hirestable = pivot_table(h3, values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'])

I'm wondering if groupby may be what I'm looking for? 我想知道groupby是否是我想要的?

Any help is appreciated. 任何帮助表示赞赏。 I've spent 3 days on this and can't seem to put it together. 我花了三天的时间,似乎无法将其放在一起。

So based off the answer below I did a pivot using the following code: 因此,根据以下答案,我使用以下代码进行了透视:

h3.pivot_table(values=['ssno'], rows=['nat_actn_2_3'], cols=['fy'], aggfunc=len).  

Which produced a somewhat decent result. 产生了相当不错的结果。 When I used 'ethnicity' or 'veteran' as a value my results came out really strange and didn't match my value counts numbers. 当我使用“种族”或“退伍军人”作为值时,我得出的结果确实很奇怪,与我的值计数不符。 Not sure if the pivot eliminates duplicates or what, but it did not come out correctly. 不知道枢轴是否消除重复项或其他内容,但未正确输出。

ssno
fy  2005    2006    2007    2008    2009    2010    2011    2012    2013    2014    2015
nat_actn_2_3                                            
100  34  20  25  18  38  43  45  14  19  25  10
101  510     453     725     795     1029    1293    957     383     470     605     145
108  170     132     112     85  123     127     84  43  40  29  10
115  9203    8972    7946    9038    10139   10480   9211    8735    10482   11258   339
130  299     313     431     324     291     325     336     202     230     436     112
140  62  74  71  75  132     125     82  42  45  74  18
141  20  16  23  17  20  14  10  9   13  17  7
170  202     433     226     278     336     386     284     265     121     118     49
171  4771    4627    4234    4196    4470    4472    3270    3145    354     341     34
190  1   1   NaN     NaN     NaN     1   NaN     NaN     NaN     NaN     NaN
702  3141    3099    3429    3030    3758    3952    3813    2902    2329    2375    650
703  2280    2354    2225    2050    2260    2328    2172    2503    2649    2856    726

Try it like this: 像这样尝试:

h3.pivot_table(values=['ethnicity', 'veteran'], index=['regions'], columns=['fy'], aggfunc=len, fill_value=0)

To get counts use the aggfunc = len 要获取计数,请使用aggfunc = len

Also your isin references a list of strings, but the data you provide for columns 'nat_actn_2_3' are int 同样,您的isin引用了一个字符串列表,但是您为'nat_actn_2_3'列提供的数据为int

Try: 尝试:

h3.pivot_table(values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'], aggfunc=len, fill_value=0)

if you have an older version of pandas 如果您有较旧版本的熊猫

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM