简体   繁体   中英

Simple cross-tabulation in pandas

I stumbled across pandas and it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at pivot/crosstab/indexing - whether I should have a Panel or DataFrames etc...). Could someone give me some pointers on how to do the following:

I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount

2010:

AB,100.00
AB,200.00
AC,150.00
AD,500.00

2011:

AB,500.00
AC,250.00
AX,900.00

These are loaded into separate DataFrame objects.

What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:

2010:

AB,300.00,2
AC,150.00,1
AD,500.00,1

2011:

AB,500.00,1
AC,250.00,1
AX,900.00,1

I can't work out whether I should be using pivot/crosstab/groupby/an index etc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.

v0.21 answer

Use pivot_table with the index parameter:

df.pivot_table(index='category', aggfunc=[len, sum])

           len   sum
         value value
category            
AB           2   300
AC           1   150
AD           1   500

<= v0.12

It is possible to do this using pivot_table for those interested:

In [8]: df
Out[8]: 
  category  value
0       AB    100
1       AB    200
2       AC    150
3       AD    500

In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[9]: 
            len    sum
          value  value
category              
AB            2    300
AC            1    150
AD            1    500

Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:

In [12]: df
Out[12]: 
  category  value  value2
0       AB    100       5
1       AB    200       5
2       AC    150       5
3       AD    500       5

In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[13]: 
            len            sum        
          value  value2  value  value2
category                              
AB            2       2    300      10
AC            1       1    150       5
AD            1       1    500       5

The main reason to use __builtin__.sum vs. np.sum is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.

Assuming that you have a file called 2010.csv with contents

category,value
AB,100.00
AB,200.00
AC,150.00
AD,500.00

Then, using the ability to apply multiple aggregation functions following a groupby , you can say:

import pandas
data_2010 = pandas.read_csv("/path/to/2010.csv")
data_2010.groupby("category").agg([len, sum])

You should get a result that looks something like

          value     
            len  sum
category            
AB            2  300
AC            1  150
AD            1  500

Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM