使用 Pandas .apply() 创建多列

Question

I have two pandas DataFrames, both containing the same categories but different 'id' columns.我有两个 Pandas DataFrames，它们都包含相同的类别但不同的“id”列。 In order to illustrate, the first table looks like this:为了说明，第一个表如下所示：

df = pd.DataFrame({
    'id': list(np.arange(1, 12)),
    'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c'],
    'weight': list(np.random.randint(1, 5, 11))
})

df['weight_sum'] = df.groupby('category')['weight'].transform('sum')
df['p'] = df['weight'] / df['weight_sum']

Output:

id  category    weight  weight_sum  p
0   1   a   4   14  0.285714
1   2   a   4   14  0.285714
2   3   a   2   14  0.142857
3   4   a   4   14  0.285714
4   5   b   4   8   0.500000
5   6   b   4   8   0.500000
6   7   c   3   15  0.200000
7   8   c   4   15  0.266667
8   9   c   2   15  0.133333
9   10  c   4   15  0.266667
10  11  c   2   15  0.133333

The second contains only 'id' and 'category'.第二个只包含“id”和“category”。

What I'm trying to do is to create a third DataFrame , that would have inherit the id of the second DataFrame, plus three new columns for the ids of the first DataFrame - each should be selected based on the p column, which represents its weight within that category.我想要做的是创建第三个 DataFrame ，它会继承第二个 DataFrame 的id ，加上第一个 DataFrame 的ids的三个新列 - 每个列都应该根据p列选择，代表它的该类别中的重量。

I've tried multiple solutions and was thinking of applying np.random.choice and .apply(), but couldn't figure out a way to make that work.我尝试了多种解决方案，并正在考虑应用np.random.choice和 .apply()，但无法找到一种方法来使这项工作发挥作用。

EDIT :编辑：

The desired output would be something like:所需的输出类似于：

user_id id_1    id_2    id_3
0   2   3   1   2
1   3   2   2   3
2   4   1   3   1

With each id being selected based on the its probability and respective category (both DataFrames have this column), and the same not showing up more than once for the same user_id .每个id根据其概率和各自的category （两个 DataFrames 都有此列）被选择，并且对于同一个user_id不会出现多次。

Desired DataFrame所需的数据帧

Answer 1

IIUC, you want to select random IDs of the same category with weighted probabilities. IIUC，您要选择具有加权概率的同一类别的随机ID。 For this you can construct a helper dataframe (dfg) and use apply :为此，您可以构建一个辅助数据框 (dfg) 并使用apply ：

df2 = pd.DataFrame({
    'id': np.random.randint(1, 12, size=11),
    'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c']})

dfg = df.groupby('category').agg(list)

df3 = df2.join(df2['category']
               .apply(lambda r: pd.Series(np.random.choice(dfg.loc[r, 'id'],
                                                           p=dfg.loc[r, 'p'],
                                                           size=3)))
               .add_prefix('id_')
               )

Output:输出：

    id category  id_0  id_1  id_2
0   11        a     2     3     3
1   10        a     2     3     1
2    4        a     1     2     3
3    7        a     2     1     4
4    5        b     6     5     5
5   10        b     6     5     6
6    8        c     9     8     8
7   11        c     7     8     7
8   11        c    10     8     8
9    4        c     9    10    10
10   1        c    11    11     9

使用 Pandas .apply() 创建多列

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-10-14 01:07:28

使用 Pandas .apply() 创建多列

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-10-14 01:07:28

解决方案1
0 已采纳 2021-10-14 01:07:28