[英]Counting column values based on values in other columns for Pandas dataframes
I'm trying to count the number of each category of storm for each unique x
and y
combination. 我正在尝试计算每个独特的
x
和y
组合的每类风暴的数量。 For example. 例如。 My dataframe looks like:
我的数据框看起来像:
x y year Category
1 1 1988 3
2 1 1977 1
2 1 1999 2
3 2 1990 4
I want to create a dataframe that looks like: 我想创建一个看起来像这样的数据框:
x y Category 1 Category 2 Category 3 Category 4
1 1 0 0 1 0
2 1 1 1 0 0
3 2 0 0 0 1
I have tried various combinations of .groupby()
and .count()
, but I am still not getting the desired result. 我曾尝试各种组合
.groupby()
和.count()
但我仍然没有得到想要的结果。 The closet thing I could get is: 我能得到的壁橱是:
df[['x','y','Category']].groupby(['Category']).count()
However, the result counts for all x
and y
, not the unique pairs: 但是,结果计算所有
x
和y
,而不是唯一对:
Cat x y
1 3773 3773
2 1230 1230
3 604 604
4 266 266
5 50 50
NA 27620 27620
TS 16884 16884
Does anyone know how to do a count operation on one column based on the uniqueness of two other columns in a dataframe? 有没有人知道如何根据数据框中另外两列的唯一性对一列进行计数操作?
pivot_table
sounds like what you want. pivot_table
听起来像你想要的。 A bit of a hack is to add a column of 1
's to use to count. 一点点黑客就是添加一列
1
来用来计算。 This allows pivot_table
to add 1
for each occurrence of a particular x
- y
and Category
combination. 这允许
pivot_table
为特定x
- y
和Category
组合的每次出现添加1
。 You will set this new column as your value
parameter in pivot_table
and the aggfunc
paraemter to np.sum
. 您将在
pivot_table
将此新列设置为value
参数,并将aggfunc
paraemter设置为np.sum
。 You'll probably want to set fill_value
to 0
as well: 您可能还想将
fill_value
设置为0
:
df['count'] = 1
result = df.pivot_table(
index=['x', 'y'], columns='Category', values='count',
fill_value=0, aggfunc=np.sum
)
result
: result
:
Category 1 2 3 4
x y
1 1 0 0 1 0
2 1 1 1 0 0
3 2 0 0 0 1
If you're interested in keeping x
and y
as columns and having the other column names as Category X
, you can rename the columns and use reset_index
: 如果您有兴趣将
x
和y
保持为列并将其他列名称作为Category X
,则可以重命名列并使用reset_index
:
result.columns = [f'Category {x}' for x in result.columns]
result = a.reset_index()
You can use pd.get_dummies
after setting index using set_index
, then use sum
with level
parameter to collapse rows: 您可以使用
pd.get_dummies
使用设定索引之后set_index
,然后用sum
与level
参数塌陷行:
pd.get_dummies(df.set_index(['x','y'])['Category'].astype(str),
prefix='Category ',
prefix_sep='')\
.sum(level=[0,1])\
.reset_index()
Output: 输出:
x y Category 1 Category 2 Category 3 Category 4
0 1 1 0 0 1 0
1 2 1 1 1 0 0
2 3 2 0 0 0 1
Or use groupby
twice, with a lot of additional, ie get_dummies
with apply
etc... 或者使用
groupby
两次,还有很多额外的,即get_dummies
with apply
等...
Like: 喜欢:
>>> df.join(df.groupby(['x','y'])['Category']
.apply(lambda x: x.astype(str).str.get_dummies().add_prefix('Category ')))
.groupby(['x','y']).sum().fillna(0).drop(['year','Category'],1).reset_index()
x y Category 1 Category 2 Category 3 Category 4
0 1 1 0.0 0.0 1.0 0.0
1 2 1 1.0 1.0 0.0 0.0
2 3 2 0.0 0.0 0.0 1.0
>>>
You can use groupby first: 您可以先使用groupby :
df_new = df.groupby(['x', 'y', 'Category']).count()
df_new
year count
x y Category
1 1 3 1 1
2 1 1 1 1
2 1 1
3 2 4 1 1
Then pivot_table 然后是pivot_table
df_new = df_new.pivot_table(index=['x', 'y'], columns='Category', values='count', fill_value=0)
df_new
Category 1 2 3 4
x y
1 1 0 0 1 0
2 1 1 1 0 0
3 2 0 0 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.