Given a dataframe with different categorical variables, how do I return a cross-tabulation with percentages instead of frequencies?
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,
'B' : ['A', 'B', 'C'] * 8,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
'D' : np.random.randn(24),
'E' : np.random.randn(24)})
pd.crosstab(df.A,df.B)
B A B C
A
one 4 4 4
three 2 2 2
two 2 2 2
Using the margins option in crosstab to compute row and column totals gets us close enough to think that it should be possible using an aggfunc or groupby, but my meager brain can't think it through.
B A B C
A
one .33 .33 .33
three .33 .33 .33
two .33 .33 .33
From Pandas 0.18.1 onwards, there's a normalize
option:
In [1]: pd.crosstab(df.A,df.B, normalize='index')
Out[1]:
B A B C
A
one 0.333333 0.333333 0.333333
three 0.333333 0.333333 0.333333
two 0.333333 0.333333 0.333333
Where you can normalise across either all
, index
(rows), or columns
.
More details are available in the documentation .
pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1)
Basically you just have the function that does row/row.sum()
, and you use apply
with axis=1
to apply it by row.
(If doing this in Python 2, you should use from __future__ import division
to make sure division always returns a float.)
We can show it as percentages by multiplying by 100
:
pd.crosstab(df.A,df.B, normalize='index')\
.round(4)*100
B A B C
A
one 33.33 33.33 33.33
three 33.33 33.33 33.33
two 33.33 33.33 33.33
Where I've rounded for convenience.
如果要查找总计的百分比,则可以除以df的len而不是行总和:
pd.crosstab(df.A, df.B).apply(lambda r: r/len(df), axis=1)
Another option is to use div rather than apply:
In [11]: res = pd.crosstab(df.A, df.B)
Divide by the sum over the index:
In [12]: res.sum(axis=1)
Out[12]:
A
one 12
three 6
two 6
dtype: int64
Similar to above, you need to do something about integer division (I use astype('float')):
In [13]: res.astype('float').div(res.sum(axis=1), axis=0)
Out[13]:
B A B C
A
one 0.333333 0.333333 0.333333
three 0.333333 0.333333 0.333333
two 0.333333 0.333333 0.333333
Normalizing the index will simply work out. Use parameter, normalize = "index"
in pd.crosstab()
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.