I stumbled across pandas and it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at pivot/crosstab/indexing
- whether I should have a Panel
or DataFrames
etc...). Could someone give me some pointers on how to do the following:
I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount
2010:
AB,100.00
AB,200.00
AC,150.00
AD,500.00
2011:
AB,500.00
AC,250.00
AX,900.00
These are loaded into separate DataFrame objects.
What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:
2010:
AB,300.00,2
AC,150.00,1
AD,500.00,1
2011:
AB,500.00,1
AC,250.00,1
AX,900.00,1
I can't work out whether I should be using pivot/crosstab/groupby/an index
etc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.
v0.21
answer
Use pivot_table
with the index
parameter:
df.pivot_table(index='category', aggfunc=[len, sum])
len sum
value value
category
AB 2 300
AC 1 150
AD 1 500
<= v0.12
It is possible to do this using pivot_table
for those interested:
In [8]: df
Out[8]:
category value
0 AB 100
1 AB 200
2 AC 150
3 AD 500
In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[9]:
len sum
value value
category
AB 2 300
AC 1 150
AD 1 500
Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:
In [12]: df
Out[12]:
category value value2
0 AB 100 5
1 AB 200 5
2 AC 150 5
3 AD 500 5
In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[13]:
len sum
value value2 value value2
category
AB 2 2 300 10
AC 1 1 150 5
AD 1 1 500 5
The main reason to use __builtin__.sum
vs. np.sum
is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.
Assuming that you have a file called 2010.csv with contents
category,value
AB,100.00
AB,200.00
AC,150.00
AD,500.00
Then, using the ability to apply multiple aggregation functions following a groupby , you can say:
import pandas
data_2010 = pandas.read_csv("/path/to/2010.csv")
data_2010.groupby("category").agg([len, sum])
You should get a result that looks something like
value
len sum
category
AB 2 300
AC 1 150
AD 1 500
Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.