简体   繁体   English

使用 Pandas .apply() 创建多列

[英]Create multiple columns with Pandas .apply()

I have two pandas DataFrames, both containing the same categories but different 'id' columns.我有两个 Pandas DataFrames,它们都包含相同的类别但不同的“id”列。 In order to illustrate, the first table looks like this:为了说明,第一个表如下所示:

df = pd.DataFrame({
    'id': list(np.arange(1, 12)),
    'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c'],
    'weight': list(np.random.randint(1, 5, 11))
})

df['weight_sum'] = df.groupby('category')['weight'].transform('sum')
df['p'] = df['weight'] / df['weight_sum']

Output:

id  category    weight  weight_sum  p
0   1   a   4   14  0.285714
1   2   a   4   14  0.285714
2   3   a   2   14  0.142857
3   4   a   4   14  0.285714
4   5   b   4   8   0.500000
5   6   b   4   8   0.500000
6   7   c   3   15  0.200000
7   8   c   4   15  0.266667
8   9   c   2   15  0.133333
9   10  c   4   15  0.266667
10  11  c   2   15  0.133333

The second contains only 'id' and 'category'.第二个只包含“id”和“category”。

What I'm trying to do is to create a third DataFrame , that would have inherit the id of the second DataFrame, plus three new columns for the ids of the first DataFrame - each should be selected based on the p column, which represents its weight within that category.我想要做的是创建第三个 DataFrame ,它会继承第二个 DataFrame 的id ,加上第一个 DataFrame 的ids的三个新列 - 每个列都应该根据p列选择,代表它的该类别中的重量。

I've tried multiple solutions and was thinking of applying np.random.choice and .apply(), but couldn't figure out a way to make that work.我尝试了多种解决方案,并正在考虑应用np.random.choice和 .apply(),但无法找到一种方法来使这项工作发挥作用。

EDIT :编辑

The desired output would be something like:所需的输出类似于:

user_id id_1    id_2    id_3
0   2   3   1   2
1   3   2   2   3
2   4   1   3   1

With each id being selected based on the its probability and respective category (both DataFrames have this column), and the same not showing up more than once for the same user_id .每个id根据其概率和各自的category (两个 DataFrames 都有此列)被选择,并且对于同一个user_id不会出现多次。

Desired DataFrame所需的数据帧

IIUC, you want to select random IDs of the same category with weighted probabilities. IIUC,您要选择具有加权概率的同一类别的随机ID。 For this you can construct a helper dataframe (dfg) and use apply :为此,您可以构建一个辅助数据框 (dfg) 并使用apply

df2 = pd.DataFrame({
    'id': np.random.randint(1, 12, size=11),
    'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c']})

dfg = df.groupby('category').agg(list)

df3 = df2.join(df2['category']
               .apply(lambda r: pd.Series(np.random.choice(dfg.loc[r, 'id'],
                                                           p=dfg.loc[r, 'p'],
                                                           size=3)))
               .add_prefix('id_')
               )

Output:输出:

    id category  id_0  id_1  id_2
0   11        a     2     3     3
1   10        a     2     3     1
2    4        a     1     2     3
3    7        a     2     1     4
4    5        b     6     5     5
5   10        b     6     5     6
6    8        c     9     8     8
7   11        c     7     8     7
8   11        c    10     8     8
9    4        c     9    10    10
10   1        c    11    11     9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM