[英]pandas: get count within each column based on different arithmetic condition
I've data frame as below.我有如下数据框。 I calculate percentile based on inputs provided.
我根据提供的输入计算百分位数。 I'd like to get count for each column that matches certain condition.
我想计算符合特定条件的每一列。 For example, get count in
a1 >value1
, similarly a2 > value2
and other column.例如,在
a1 >value1
中获取计数,类似a2 > value2
和其他列。
import pandas as pd
df = pd.DataFrame([[10,11,20],[580,11,20],
[500,11,20],
[110,111,420],[11,11,20],[80,91,90],
[80,91,'NA'],
[10,11,13],[0,14,1111],
[20,104,111],[220,314,1000],[200,30,2000],
[61,31,10],[516,71,20],[10,30,330]],
columns=['a1','a2','a3'])
calculate and describe column based on input percentile, for columns interested. drop NAs
print( (df[["a1","a2","a3"]].dropna()).describe(percentiles =[0.90,0.91,
0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99] ))
I face certain issues:我面临一些问题:
Column a3
is removed. a3
列被删除。 How do I save it from being thrown away, but simply throw away that row, or ignore NA?我如何避免它被扔掉,而只是扔掉那一行,或者忽略 NA?
I can get value for each column as:我可以获得每列的值:
print(len(df[(df['a1']>200) ]))
print(len(df[(df['a2']>100) ]))
However, this gets tricky and unreadable when data frame has ~10 columns.但是,当数据框有大约 10 列时,这会变得棘手且不可读。 How do I get counts in a data frame manner for columns for a condition (
a1 > 100
, a2>90
, a3>56
)?如何以数据框方式获取条件列的计数(
a1 > 100
, a2>90
, a3>56
)?
Thank you.谢谢你。
If compare by dictionary with keys by all columns names and values for threshold in DataFrame.gt
get boolean DataFrame
, then for count True
s use sum
(because processing like 1
):如果通过字典与
DataFrame.gt
中的所有列名称和阈值的键进行比较,则得到 boolean DataFrame
,然后对于 count True
s 使用sum
(因为像处理一样1
:
df = df.apply(pd.to_numeric, errors='coerce')
s = df.gt({'a1': 100, 'a2': 90, 'a3': 56}).sum()
print (s)
a1 6
a2 5
a3 7
dtype: int64
Details :详情:
print(df.gt({'a1': 100, 'a2': 90, 'a3': 56}))
a1 a2 a3
0 False False False
1 True False False
2 True False False
3 True True True
4 False False False
5 False True True
6 False True False
7 False False False
8 False False True
9 False True True
10 True True True
11 True False True
12 False False False
13 True False False
14 False False True
Your solution working well for me if removed dropna
:如果删除
dropna
,您的解决方案对我来说效果很好:
df = df.apply(pd.to_numeric, errors='coerce')
L = [0.90,0.91, 0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99]
print( df[["a1","a2","a3"]].describe(percentiles=L))
a1 a2 a3
count 15.000000 15.000000 14.000000
mean 160.533333 62.800000 370.357143
std 204.229166 79.165469 596.271054
min 0.000000 11.000000 10.000000
50% 80.000000 30.000000 55.000000
90% 509.600000 108.200000 1077.700000
91% 511.840000 109.180000 1092.130000
92% 514.080000 110.160000 1106.560000
93% 517.280000 115.060000 1191.010000
94% 526.240000 143.480000 1306.580000
95% 535.200000 171.900000 1422.150000
96% 544.160000 200.320000 1537.720000
97% 553.120000 228.740000 1653.290000
98% 562.080000 257.160000 1768.860000
99% 571.040000 285.580000 1884.430000
max 580.000000 314.000000 2000.000000
EDIT1: If need comapre quantiles by columns from list use: EDIT1:如果需要使用列表中的列进行比较分位数:
df = df.apply(pd.to_numeric, errors='coerce')
cols = ['a1','a2','a3']
print (df[cols].quantile(0.5))
a1 80.0
a2 30.0
a3 55.0
Name: 0.5, dtype: float64
print (df[cols].gt(df[cols].quantile(0.5)))
a1 a2 a3
0 False False False
1 True False False
2 True False False
3 True True True
4 False False False
5 False True True
6 False True False
7 False False False
8 False False True
9 False True True
10 True True True
11 True False True
12 False True False
13 True True False
14 False False True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.