pandas groupby 和列的每个值出现的百分比

Question

I have a pandas dataframe like this and want to create a column like created_column :我有一个像这样的熊猫数据框，想创建一个像created_column的列：

       iv_1  iv_2  iv_3  iv_4  iv_5  col2rplc  created_column
0       0      0     0     0     0      a          0
333     0      0     0     0     0      b          0
      ......
222     1      2     3     4     5      aa         1
324     1      2     3     4     5      cc         1
      ......
1234    1      0     0     0     1      a          1
1235    0      2     0     4     0      a          0
1236    0      0     3     0     0      a          0
1237    0      0     1     0     0      b          0
1238    0      2     0     2     0      b          0
1239    3      0     0     0     3      b          1

explanation:解释：
I want to create a column that will have 1 in rows where values in iv_5 column has occurred for less than or equal to 40% of the data, that would be for rows with values 1, 3 & 5, as shown in above example.我想创建一个列，其中iv_5列中的值发生在小于或等于 40% 的数据的行中有 1 个，这将用于值为 1、3 和 5 的行，如上例所示。 how do i do this?我该怎么做呢？

second question:第二个问题：
How do I also include less than x% and greater than y%, in creation of other column, as similar to above column creation.我如何在其他列的创建中还包含小于 x% 和大于 y% 的内容，与上面的列创建类似。

Answer 1

Use GroupBy.transform with divide length of DtaFrame and test by Series.le for less or equal:使用GroupBy.transform用的分长度DtaFrame和测试通过Series.le为小于或等于：

df['created_column'] = df.groupby('iv_5')['iv_5'].transform('size').div(len(df)).le(0.4).view('i1')
print (df)
      iv_1  iv_2  iv_3  iv_4  iv_5 col2rplc  created_column
0        0     0     0     0     0        a               0
333      0     0     0     0     0        b               0
222      1     2     3     4     5       aa               1
324      1     2     3     4     5       cc               1
1234     1     0     0     0     1        a               1
1235     0     2     0     4     0        a               0
1236     0     0     3     0     0        a               0
1237     0     0     1     0     0        b               0
1238     0     2     0     2     0        b               0
1239     3     0     0     0     3        b               1

Or:或者：

s = df['iv_5'].value_counts(normalize=True)
idx = s.index[s <= 0.4]

df['created_column'] = df['iv_5'].isin(idx).view('i1')

If need Series.between , both are inclusive by default, it means >= , <= , for > and < use parameter inclusive=False :如果需要Series.between ，默认情况下两者都是包含的，这意味着>= ， <= ，对于>和<使用参数inclusive=False ：

df['created_column'] = df.groupby('iv_5')['iv_5'].transform('size').div(len(df)).between(0.2, 0.5).view('i1')
print (df)

      iv_1  iv_2  iv_3  iv_4  iv_5 col2rplc  created_column
0        0     0     0     0     0        a               0
333      0     0     0     0     0        b               0
222      1     2     3     4     5       aa               1
324      1     2     3     4     5       cc               1
1234     1     0     0     0     1        a               0
1235     0     2     0     4     0        a               0
1236     0     0     3     0     0        a               0
1237     0     0     1     0     0        b               0
1238     0     2     0     2     0        b               0
1239     3     0     0     0     3        b               0

If need combination like > and <= between cannot be used, here is alternative:如果需要像>和<=之间的组合不能使用，这里是替代方案：

s1 = df.groupby('iv_5')['iv_5'].transform('size').div(len(df))
df['created_column'] = ((s1 > 0.2) & (s1 <= 0.6)).view('i1')

print (df)
      iv_1  iv_2  iv_3  iv_4  iv_5 col2rplc  created_column
0        0     0     0     0     0        a               1
333      0     0     0     0     0        b               1
222      1     2     3     4     5       aa               0
324      1     2     3     4     5       cc               0
1234     1     0     0     0     1        a               0
1235     0     2     0     4     0        a               1
1236     0     0     3     0     0        a               1
1237     0     0     1     0     0        b               1
1238     0     2     0     2     0        b               1
1239     3     0     0     0     3        b               0

pandas groupby 和列的每个值出现的百分比

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-20 06:17:10

pandas groupby 和列的每个值出现的百分比

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-20 06:17:10

解决方案1
1 已采纳 2020-11-20 06:17:10