對於 Pandas DataFrame 列中的每個唯一值，如何隨機選擇一定比例的行？

Question

Python新手在這里。 想象一個看起來像這樣的 csv 文件：

（...除了在現實生活中，Person 列中有 20 個不同的名稱，每個 Person 有 300-500 行。此外，還有多個數據列，而不是一個。）

我想要做的是隨機標記每個 Person 行的 10% 並將其標記在新列中。 我想出了一個非常復雜的方法來做到這一點——它涉及創建一個隨機數的輔助列和各種不必要的復雜的拼圖游戲。 它奏效了，但很瘋狂。 最近，我想出了這個：

import pandas as pd 
df = pd.read_csv('source.csv')
df['selected'] = ''

names= list(df['Person'].unique())  #gets list of unique names

for name in names:
     df_temp = df[df['Person']== name]
     samp = int(len(df_temp)/10)   # I want to sample 10% for each name
     df_temp = df_temp.sample(samp)
     df_temp['selected'] = 'bingo!'   #a new column to mark the rows I've randomly selected
     df = df.merge(df_temp, how = 'left', on = ['Person','data'])
     df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
        #Note:  initially instead of the line above, I tried the line below, but it didn't work too well:
        #df['temp'] = df['selected_x'] + df['selected_y']
     df = df[['Person','data','temp']]
     df = df.rename(columns = {'temp':'selected'})

df['selected'] = df['selected'].str.replace('nan','').str.strip()  #cleans up the column

如您所見，基本上我正在為每個人提取一個臨時數據幀，使用DF.sample(number)進行隨機化，然后使用DF.merge將“標記”行返回到原始數據幀中。 它涉及遍歷一個列表來創建每個臨時 DataFrame ......我的理解是迭代有點蹩腳。

必須有一種更加 Pythonic 的矢量化方式來做到這一點，對吧？ 無需迭代。 也許涉及groupby東西？ 任何想法或建議非常感謝。

編輯：這是另一種避免merge ......但它仍然很笨重：

import pandas as pd
import math
    
   #SETUP TEST DATA:
    y = ['Alex'] * 2321 + ['Doug'] * 34123  + ['Chuck'] * 2012 + ['Bob'] * 9281 
    z = ['xyz'] * len(y)
    df = pd.DataFrame({'persons': y, 'data' : z})
    df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
    percent = 10  #CHANGE AS NEEDED
    
    #Add a 'helper' column with random numbers
    df['rand'] = np.random.random(df.shape[0])
    df = df.sample(frac=1)  #this shuffles data, just to show order doesn't matter
    
    #CREATE A HELPER LIST
    helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
    for row in helper:
        df_temp = df[df['persons'] == row[0]][['persons','rand']]
        lim = math.ceil(len(df_temp) * percent*0.01)
        row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
               
    def flag(name,num):
        for row in helper:
            if row[0] == name:
                if num >= row[2]:
                    return 'yes'
                else:
                    return 'no'
    
    df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)

Answer 1

如果我理解正確，您可以使用以下方法實現：

df = pd.DataFrame(data={'persons':['A']*10 + ['B']*10, 'col_1':[2]*20})
percentage_to_flag = 0.5
a = df.groupby(['persons'])['col_1'].apply(lambda x: x.index.isin(x.sample(frac=percentage_to_flag, replace=False).index)).reset_index(drop=True)
df['flagged'] = a

Input:

       persons  col_1
    0        A      2
    1        A      2
    2        A      2
    3        A      2
    4        A      2
    5        A      2
    6        A      2
    7        A      2
    8        A      2
    9        A      2
    10       B      2
    11       B      2
    12       B      2
    13       B      2
    14       B      2
    15       B      2
    16       B      2
    17       B      2
    18       B      2
    19       B      2

Output with 50% flagged rows in each group:

   persons  col_1  flagged
0        A      2     True
1        A      2     True
2        A      2     True
3        A      2     True
4        A      2    False
5        A      2    False
6        A      2    False
7        A      2    False
8        A      2    False
9        A      2     True
10       B      2    False
11       B      2     True
12       B      2     True
13       B      2    False
14       B      2     True
15       B      2     True
16       B      2    False
17       B      2     True
18       B      2    False
19       B      2    False

Answer 2

您可以使用groupby.sample來挑選整個數據幀的樣本進行進一步處理，或者識別數據幀的行以標記是否更方便。

import pandas as pd

percentage_to_flag = 0.5

# Toy data: 8 rows, persons A and B.
df = pd.DataFrame(data={'persons':['A']*4 + ['B']*4, 'data':range(8)})
#   persons  data
# 0       A     0
# 1       A     1
# 2       A     2
# 3       A     3
# 4       B     4
# 5       B     5
# 6       B     6
# 7       B     7

# Pick out random sample of dataframe.
random_state = 41  # Change to get different random values.
df_sample = df.groupby("persons").sample(frac=percentage_to_flag,
                                         random_state=random_state)
#   persons  data
# 1       A     1
# 2       A     2
# 7       B     7
# 6       B     6

# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
#   persons  data  marked
# 0       A     0   False
# 1       A     1    True
# 2       A     2    True
# 3       A     3   False
# 4       B     4   False
# 5       B     5   False
# 6       B     6    True
# 7       B     7    True

如果你真的不想要子采樣數據幀df_sample你可以直接標記原始數據幀的樣本：

# Mark random sample in original dataframe with minimal intermediate data.
df["marked2"] = False
df.loc[df.groupby("persons")["data"].sample(frac=percentage_to_flag,
                                            random_state=random_state).index,
       "marked2"] = True
#   persons  data  marked  marked2
# 0       A     0   False    False
# 1       A     1    True     True
# 2       A     2    True     True
# 3       A     3   False    False
# 4       B     4   False    False
# 5       B     5   False    False
# 6       B     6    True     True
# 7       B     7    True     True

Answer 3

對建議的解決方案發表了一些評論。 我想出了一種避免“合並”的方法，但恐怕它仍然很笨拙。 然后，它似乎工作。

    import pandas as pd
    import math
    
   #SETUP TEST DATA:
    y = ['Alex'] * 2321 + ['Doug'] * 34123  + ['Chuck'] * 2012 + ['Bob'] * 9281 
    z = ['xyz'] * len(y)
    df = pd.DataFrame({'persons': y, 'data' : z})
    df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
    percent = 10  #CHANGE AS NEEDED
    
    #Add a 'helper' column with random numbers
    df['rand'] = np.random.random(df.shape[0])
    df = df.sample(frac=1)  #this shuffles data, just to show order doesn't matter
    
    #CREATE A HELPER LIST
    helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
    for row in helper:
        df_temp = df[df['persons'] == row[0]][['persons','rand']]
        lim = math.ceil(len(df_temp) * percent*0.01)
        row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
        
    
    def flag(name,num):
        for row in helper:
            if row[0] == name:
                if num >= row[2]:
                    return 'yes'
                else:
                    return 'no'
    
    df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)

然后，檢查結果：

    piv = df.pivot_table(index="persons", columns="flag", values="data", aggfunc='count', fill_value=0)
    piv = piv.apivend(piv.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
    piv['% selected'] = 100 * piv.yes/piv.Total
    print(piv)

OUTPUT:
flag        no   yes  Total  % selected
persons                                
Alex      2088   233   2321   10.038776
Bob       8352   929   9281   10.009697
Chuck     1810   202   2012   10.039761
Doug     30710  3413  34123   10.002051
Total    42960  4777  47737   10.006913

正如我所說，仍然很笨拙，因此非常歡迎有關讓其他更優雅的建議解決方案處理此類數據的任何提示

Answer 4

這是 TMbailey 的答案，經過調整后可以在我的 Python 版本中使用。 （不想編輯其他人的答案，但如果我做錯了，我會將其刪除。）這真的很棒，而且速度非常快！

import pandas as pd
import sys

percentage_to_flag = .01

# Toy data:
y = ['Alex'] * 2321 + ['Eddie'] * 876 + ['Doug'] * 34123  + ['Chuck'] * 2012 + ['Bob'] * 9281 
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #optional shuffle, just to show order doesn't matter

# Pick out random sample of dataframe.
random_state = 41  # Change to get different random values.
df_sample = df.groupby("persons").apply(lambda x: x.sample(frac=percentage_to_flag,random_state=random_state))
#had to use lambda in line above
df_sample = df_sample.reset_index(level=0, drop=True)  #had to add this to simplify multi-index DF

# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True

然后檢查：

pp = df.pivot_table(index="persons", columns="marked", values="data", aggfunc='count', fill_value=0)
pp.columns = ['no','yes']
pp = pp.append(pp.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
pp['% selected'] = 100 * pp.yes/pp.Total
print(pp)

OUTPUT:
            no  yes  Total  % selected
persons                               
Alex      2298   23   2321    0.990952
Bob       9188   93   9281    1.002047
Chuck     1992   20   2012    0.994036
Doug     33781  342  34123    1.002257
Eddie      867    9    876    1.027397
Total    48126  487  48613    1.001790

我唯一的保留是采樣有時會低於設定的百分比......一直在查看DF.sample()文檔，看看是否有辦法在采樣時“四舍五入”，但到目前為止還沒有不知道這是否可能。 討厭在percentage_to_flag（例如0.001）中添加“fudge factor”，因為該因素會根據百分比的大小而變化（即，對於較低的百分比，需要更大一點）。

對於 Pandas DataFrame 列中的每個唯一值，如何隨機選擇一定比例的行？

問題描述

3 個解決方案

解決方案1
0 2021-10-16 17:39:39

解決方案2
0 2021-10-17 12:42:10

解決方案3
0 2021-10-19 06:45:42

解決方案4
0 2021-10-19 08:59:49

對於 Pandas DataFrame 列中的每個唯一值，如何隨機選擇一定比例的行？

問題描述

3 個解決方案

解決方案1 0 2021-10-16 17:39:39

解決方案2 0 2021-10-17 12:42:10

解決方案3 0 2021-10-19 06:45:42

解決方案4 0 2021-10-19 08:59:49

解決方案1
0 2021-10-16 17:39:39

解決方案2
0 2021-10-17 12:42:10

解決方案3
0 2021-10-19 06:45:42

解決方案4
0 2021-10-19 08:59:49