简体   繁体   English

从数据场的每行的 n 列中随机选择 k 个值,并将它们存储到同一数据帧的 k 列中

[英]Randomly selecting k values from n columns of the datafarme for each row and store them into k columns of same dataframe

My datafarme consist of 1M records which have the following format.我的数据场由 1M 条记录组成,其格式如下。

ID      SEGMENT group   CODE_1      CODE_2      CODE_3      CODE_4      CODE_5      CODE_6      CODE_7      CODE_8      CODE_9  CODE_10     
100006  History ML1     Offer_25    Offer_4     Offer_8     Offer_10    Offer_2     Offer_9     Offer_3     Offer_1     Offer_7 Offer_12
100007  History ML1     Offer_35    Offer_4     Offer_18    Offer_10    Offer_22    Offer_9     Offer_3     Offer_1     Offer_7 Offer_12
1000065 History ML1     Offer_5     Offer_40    Offer_8     Offer_1     Offer_21    Offer_9     Offer_3     Offer_1     Offer_7 Offer_13
10001   History ML1     Offer_5     Offer_41    Offer_18    Offer_15    Offer_2     Offer_19    Offer_3     Offer_11    Offer_7 Offer_12
900010  History ML1     Offer_15    Offer_4     Offer_18    Offer_10    Offer_20    Offer_19    Offer_3     Offer_6     Offer_7 Offer_12

Now I want to keep ID, Segment, Group and Code1 to Code4 as it is but want to have just two columns code_5 to Code_6 from rest of the columns where for each row two distict values randomly are derived from the columns values of Code_5 to Code_10 .现在我想保持 ID、Segment、Group 和 Code1 到 Code4 的原样,但希望只有两列 code_5 到 Code_6 来自其余的列,其中每行两个 distict 值随机派生自 Code_5 到 Code_10 的列值.

Which will look like this:看起来像这样:

ID      SEGMENT group   CODE_1      CODE_2      CODE_3      CODE_4      CODE_5      CODE_6      
100006  History ML1     Offer_25    Offer_4     Offer_8     Offer_10    Offer_1     Offer_12
100007  History ML1     Offer_35    Offer_4     Offer_18    Offer_10    Offer_7     Offer_9 
1000065 History ML1     Offer_5     Offer_40    Offer_8     Offer_1     Offer_13    Offer_3 
10001   History ML1     Offer_5     Offer_41    Offer_18    Offer_15    Offer_2     Offer_19
900010  History ML1     Offer_15    Offer_4     Offer_18    Offer_10    Offer_12    Offer_6 

I tried something like this but it is too slow:我试过这样的事情,但它太慢了:

offers_cat = pd.DataFrame([], columns = ['Code_5','Code_6'])
recommend_df_test = recommend_df
number_of_offers = 6
variety_offers = 2
offer_range = number_of_offers - variety_offers
new_df = pd.DataFrame()
for index, row in recommend_df_test.iterrows():
    list_append = []
    lst_tmp =[]
    for i in range (offer_range+1,number_of_offers+5):
        offer_code = "CODE_"+str(i)
        list_append.append(row[offer_code])
    lst_tmp.append(np.random.choice(list_append,size=variety_offers,replace=False))
    df_tmp = pd.DataFrame(lst_tmp, columns=offers_cat.columns)
    df_tmp["ID"] = row["ID"]
    new_df = pd.concat([new_df,df_tmp])

This code gives me new Datafarme having ID and two offers with random value chosen each row from columns 5 to 10.此代码为我提供了新的 Datafarme,它具有 ID 和两个从第 5 列到第 10 列的每行中随机选择的随机值。

Please help me improve the performance请帮助我提高性能

What you need is to apply a row-wise function to one of your columns.您需要的是将逐行函数应用于您的一列。 assuming a data frame like this假设这样的数据框

df = pandas.DataFrame(
  [['a1', 'b1', 'c1'], ['a2', 'b2', 'c2'], ['a3', 'b3', 'c3']],
  columns=('A', 'B', 'C')
)

The output would be:输出将是:

    A   B   C
0   a1  b1  c1
1   a2  b2  c2
2   a3  b3  c3

Now you want to replace column A (or create a new column, doesn't matter) by choosing randomly one out of the other columns values on the same row.现在您想通过从同一行的其他列值中随机选择一个来替换列A (或创建一个新列,无关紧要)。 Here is how you do it:这是你如何做到的:

import numpy as np
cols = ['B', 'C']
df.A = df.apply(
    lambda r: np.random.choice(r[cols]),
    axis=1
)

Here I have used apply to run a mapping function to all of the data frame.在这里,我使用apply对所有数据框运行映射函数。 the axis=1 tells the method to run apply on rows. axis=1告诉方法在行上运行应用程序。 on lambda function it takes the row values r and gives the values of the columns of interest cols=['B','C'] to the random choice function from numpy.lambda函数上,它采用行值r并将感兴趣的列cols=['B','C']给来自 numpy 的随机选择函数。 The result would be:结果将是:

    A   B   C
0   b1  b1  c1
1   b2  b2  c2
2   c3  b3  c3

Here's what I would do:这是我会做的:

# for repeatability
np.random.seed(1)

# sampling the columns, 2 for each row
a = np.random.choice(range(5), size=len(df)*2)

# sampling the values given the columns
new_values = df.iloc[:,-5:].values[np.repeat(range(len(df)),2), a].reshape(-1,2)

# creating new data:
pd.concat([df.iloc[:,:-5], 
           pd.DataFrame(new_values, columns=('Code_5', 'Code_6'))],
          axis=1)

Output:输出:

             ID       SEGMENT    group     CODE_1    CODE_2    CODE_3    CODE_4    CODE_5    CODE_6
--  -------  -------  ---------  --------  --------  --------  --------  --------  --------  --------
 0   100006  History  ML1        Offer_25  Offer_4   Offer_8   Offer_10  Offer_2   Offer_7   Offer_12
 1   100007  History  ML1        Offer_35  Offer_4   Offer_18  Offer_10  Offer_22  Offer_9   Offer_3
 2  1000065  History  ML1        Offer_5   Offer_40  Offer_8   Offer_1   Offer_21  Offer_7   Offer_9
 3    10001  History  ML1        Offer_5   Offer_41  Offer_18  Offer_15  Offer_2   Offer_19  Offer_3
 4   900010  History  ML1        Offer_15  Offer_4   Offer_18  Offer_10  Offer_20  Offer_12  Offer_12

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM