简体   繁体   English

如何计算 Pandas 中不同唯一 ID 的特定值出现的次数?

[英]How can I count the number of occurrences of a specific value for different unique IDs in Pandas?

I have large dataset of over 10,000 entries.我有超过 10,000 个条目的大型数据集。 The dataset contains a unique ID, a year that an event occurred, and the size of that event.该数据集包含唯一 ID、事件发生的年份以及该事件的大小。 I want to count the number of events of above and below a specific threshold value for each unique ID.我想为每个唯一 ID 计算高于和低于特定阈值的事件数。 However, for events below the threshold, I only want to count the event if it occurred after a certain year.但是,对于低于阈值的事件,我只想计算发生在某一年之后的事件。

As an example, let's say I have the below data:例如,假设我有以下数据:

Unique ID, Year, Size  
111, 1980, 1  
111, 1992, 2  
111, 2000, 4  
222, 1990, 5  
222, 1994, 3  
333, 1999, 2  
333, 2011, 5  
333, 2012, 2  
333, 2016, 1 

I want to categorize how many events are either equal to or above size 3 for each unique ID.我想为每个唯一 ID 对大小等于或大于 3 的事件进行分类。 But I also only want to count events that are <=3 if they occurred after a specific year.但我也只想计算 <=3 的事件,如果它们发生在特定年份之后。 For example, I only want to count events that occurred after 1980 for Unique ID 1, after 1992 for Unique ID 2, and after 2000 for unique ID 3.例如,对于唯一 ID 1,我只想计算 1980 年之后发生的事件,唯一 ID 2 是 1992 年之后,唯一 ID 3 是 2000 年之后。

Based on the above example data, I would be expecting the following result根据上面的示例数据,我期待以下结果

Unique ID, <=3, >3唯一 ID,<=3,>3

111, 1, 1    
222, 1, 1  
333, 2, 1 

Because there is for each Unique ID different threshold year create dictionary for Series.map , so is possible filter, here by Series.lt for less method with boolean indexing :因为每个Unique ID都有不同的阈值年份为Series.map创建dictionary ,所以可能的过滤器,这里是Series.lt使用boolean indexing less 方法:

d = {111:1980, 222:1992, 333:2000}
df = df[df['Unique ID'].map(d).lt(df['Year'])]
print (df)
   Unique ID  Year  Size
1        111  1992     2
2        111  2000     4
4        222  1994     3
6        333  2011     5
7        333  2012     2
8        333  2016     1

And then for count is used crosstab with numpy.where :然后 for count 与numpy.where一起使用crosstab

df = pd.crosstab(df['Unique ID'], np.where(df['Size'].le(3), '<=3','>3'))
print (df)
col_0      <=3  >3
Unique ID         
111          1   1
222          1   0
333          2   1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM