[英]pandas groupby and percentage of occurrences of each value of a column
I have a pandas dataframe like this and want to create a column like created_column
:我有一个像这样的熊猫数据框,想创建一个像
created_column
的列:
iv_1 iv_2 iv_3 iv_4 iv_5 col2rplc created_column
0 0 0 0 0 0 a 0
333 0 0 0 0 0 b 0
......
222 1 2 3 4 5 aa 1
324 1 2 3 4 5 cc 1
......
1234 1 0 0 0 1 a 1
1235 0 2 0 4 0 a 0
1236 0 0 3 0 0 a 0
1237 0 0 1 0 0 b 0
1238 0 2 0 2 0 b 0
1239 3 0 0 0 3 b 1
explanation:解释:
I want to create a column that will have 1 in rows where values in iv_5
column has occurred for less than or equal to 40% of the data, that would be for rows with values 1, 3 & 5, as shown in above example.我想创建一个列,其中
iv_5
列中的值发生在小于或等于 40% 的数据的行中有 1 个,这将用于值为 1、3 和 5 的行,如上例所示。 how do i do this?我该怎么做呢?
second question:第二个问题:
How do I also include less than x% and greater than y%, in creation of other column, as similar to above column creation.我如何在其他列的创建中还包含小于 x% 和大于 y% 的内容,与上面的列创建类似。
Use GroupBy.transform
with divide length of DtaFrame
and test by Series.le
for less or equal:使用
GroupBy.transform
用的分长度DtaFrame
和测试通过Series.le
为小于或等于:
df['created_column'] = df.groupby('iv_5')['iv_5'].transform('size').div(len(df)).le(0.4).view('i1')
print (df)
iv_1 iv_2 iv_3 iv_4 iv_5 col2rplc created_column
0 0 0 0 0 0 a 0
333 0 0 0 0 0 b 0
222 1 2 3 4 5 aa 1
324 1 2 3 4 5 cc 1
1234 1 0 0 0 1 a 1
1235 0 2 0 4 0 a 0
1236 0 0 3 0 0 a 0
1237 0 0 1 0 0 b 0
1238 0 2 0 2 0 b 0
1239 3 0 0 0 3 b 1
Or:或者:
s = df['iv_5'].value_counts(normalize=True)
idx = s.index[s <= 0.4]
df['created_column'] = df['iv_5'].isin(idx).view('i1')
If need Series.between
, both are inclusive by default, it means >=
, <=
, for >
and <
use parameter inclusive=False
:如果需要
Series.between
,默认情况下两者都是包含的,这意味着>=
, <=
,对于>
和<
使用参数inclusive=False
:
df['created_column'] = df.groupby('iv_5')['iv_5'].transform('size').div(len(df)).between(0.2, 0.5).view('i1')
print (df)
iv_1 iv_2 iv_3 iv_4 iv_5 col2rplc created_column
0 0 0 0 0 0 a 0
333 0 0 0 0 0 b 0
222 1 2 3 4 5 aa 1
324 1 2 3 4 5 cc 1
1234 1 0 0 0 1 a 0
1235 0 2 0 4 0 a 0
1236 0 0 3 0 0 a 0
1237 0 0 1 0 0 b 0
1238 0 2 0 2 0 b 0
1239 3 0 0 0 3 b 0
If need combination like >
and <=
between cannot be used, here is alternative:如果需要像
>
和<=
之间的组合不能使用,这里是替代方案:
s1 = df.groupby('iv_5')['iv_5'].transform('size').div(len(df))
df['created_column'] = ((s1 > 0.2) & (s1 <= 0.6)).view('i1')
print (df)
iv_1 iv_2 iv_3 iv_4 iv_5 col2rplc created_column
0 0 0 0 0 0 a 1
333 0 0 0 0 0 b 1
222 1 2 3 4 5 aa 0
324 1 2 3 4 5 cc 0
1234 1 0 0 0 1 a 0
1235 0 2 0 4 0 a 1
1236 0 0 3 0 0 a 1
1237 0 0 1 0 0 b 1
1238 0 2 0 2 0 b 1
1239 3 0 0 0 3 b 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.