![](/img/trans.png)
[英]How can I select rows randomly in proportion to the number of unique values for each group in Python?
[英]For each unique value in a pandas DataFrame column, how can I randomly select a proportion of rows?
Python新手在這里。 想象一個看起來像這樣的 csv 文件:
(...除了在現實生活中,Person 列中有 20 個不同的名稱,每個 Person 有 300-500 行。此外,還有多個數據列,而不是一個。)
我想要做的是隨機標記每個 Person 行的 10% 並將其標記在新列中。 我想出了一個非常復雜的方法來做到這一點——它涉及創建一個隨機數的輔助列和各種不必要的復雜的拼圖游戲。 它奏效了,但很瘋狂。 最近,我想出了這個:
import pandas as pd
df = pd.read_csv('source.csv')
df['selected'] = ''
names= list(df['Person'].unique()) #gets list of unique names
for name in names:
df_temp = df[df['Person']== name]
samp = int(len(df_temp)/10) # I want to sample 10% for each name
df_temp = df_temp.sample(samp)
df_temp['selected'] = 'bingo!' #a new column to mark the rows I've randomly selected
df = df.merge(df_temp, how = 'left', on = ['Person','data'])
df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
#Note: initially instead of the line above, I tried the line below, but it didn't work too well:
#df['temp'] = df['selected_x'] + df['selected_y']
df = df[['Person','data','temp']]
df = df.rename(columns = {'temp':'selected'})
df['selected'] = df['selected'].str.replace('nan','').str.strip() #cleans up the column
如您所見,基本上我正在為每個人提取一個臨時數據幀,使用DF.sample(number)
進行隨機化,然后使用DF.merge
將“標記”行返回到原始數據幀中。 它涉及遍歷一個列表來創建每個臨時 DataFrame ......我的理解是迭代有點蹩腳。
必須有一種更加 Pythonic 的矢量化方式來做到這一點,對吧? 無需迭代。 也許涉及groupby
東西? 任何想法或建議非常感謝。
編輯:這是另一種避免merge
......但它仍然很笨重:
import pandas as pd
import math
#SETUP TEST DATA:
y = ['Alex'] * 2321 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
percent = 10 #CHANGE AS NEEDED
#Add a 'helper' column with random numbers
df['rand'] = np.random.random(df.shape[0])
df = df.sample(frac=1) #this shuffles data, just to show order doesn't matter
#CREATE A HELPER LIST
helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
for row in helper:
df_temp = df[df['persons'] == row[0]][['persons','rand']]
lim = math.ceil(len(df_temp) * percent*0.01)
row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
def flag(name,num):
for row in helper:
if row[0] == name:
if num >= row[2]:
return 'yes'
else:
return 'no'
df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)
如果我理解正確,您可以使用以下方法實現:
df = pd.DataFrame(data={'persons':['A']*10 + ['B']*10, 'col_1':[2]*20})
percentage_to_flag = 0.5
a = df.groupby(['persons'])['col_1'].apply(lambda x: x.index.isin(x.sample(frac=percentage_to_flag, replace=False).index)).reset_index(drop=True)
df['flagged'] = a
Input:
persons col_1
0 A 2
1 A 2
2 A 2
3 A 2
4 A 2
5 A 2
6 A 2
7 A 2
8 A 2
9 A 2
10 B 2
11 B 2
12 B 2
13 B 2
14 B 2
15 B 2
16 B 2
17 B 2
18 B 2
19 B 2
Output with 50% flagged rows in each group:
persons col_1 flagged
0 A 2 True
1 A 2 True
2 A 2 True
3 A 2 True
4 A 2 False
5 A 2 False
6 A 2 False
7 A 2 False
8 A 2 False
9 A 2 True
10 B 2 False
11 B 2 True
12 B 2 True
13 B 2 False
14 B 2 True
15 B 2 True
16 B 2 False
17 B 2 True
18 B 2 False
19 B 2 False
您可以使用groupby.sample
來挑選整個數據幀的樣本進行進一步處理,或者識別數據幀的行以標記是否更方便。
import pandas as pd
percentage_to_flag = 0.5
# Toy data: 8 rows, persons A and B.
df = pd.DataFrame(data={'persons':['A']*4 + ['B']*4, 'data':range(8)})
# persons data
# 0 A 0
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 B 6
# 7 B 7
# Pick out random sample of dataframe.
random_state = 41 # Change to get different random values.
df_sample = df.groupby("persons").sample(frac=percentage_to_flag,
random_state=random_state)
# persons data
# 1 A 1
# 2 A 2
# 7 B 7
# 6 B 6
# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
# persons data marked
# 0 A 0 False
# 1 A 1 True
# 2 A 2 True
# 3 A 3 False
# 4 B 4 False
# 5 B 5 False
# 6 B 6 True
# 7 B 7 True
如果你真的不想要子采樣數據幀df_sample
你可以直接標記原始數據幀的樣本:
# Mark random sample in original dataframe with minimal intermediate data.
df["marked2"] = False
df.loc[df.groupby("persons")["data"].sample(frac=percentage_to_flag,
random_state=random_state).index,
"marked2"] = True
# persons data marked marked2
# 0 A 0 False False
# 1 A 1 True True
# 2 A 2 True True
# 3 A 3 False False
# 4 B 4 False False
# 5 B 5 False False
# 6 B 6 True True
# 7 B 7 True True
對建議的解決方案發表了一些評論。 我想出了一種避免“合並”的方法,但恐怕它仍然很笨拙。 然后,它似乎工作。
import pandas as pd
import math
#SETUP TEST DATA:
y = ['Alex'] * 2321 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
percent = 10 #CHANGE AS NEEDED
#Add a 'helper' column with random numbers
df['rand'] = np.random.random(df.shape[0])
df = df.sample(frac=1) #this shuffles data, just to show order doesn't matter
#CREATE A HELPER LIST
helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
for row in helper:
df_temp = df[df['persons'] == row[0]][['persons','rand']]
lim = math.ceil(len(df_temp) * percent*0.01)
row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
def flag(name,num):
for row in helper:
if row[0] == name:
if num >= row[2]:
return 'yes'
else:
return 'no'
df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)
然后,檢查結果:
piv = df.pivot_table(index="persons", columns="flag", values="data", aggfunc='count', fill_value=0)
piv = piv.apivend(piv.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
piv['% selected'] = 100 * piv.yes/piv.Total
print(piv)
OUTPUT:
flag no yes Total % selected
persons
Alex 2088 233 2321 10.038776
Bob 8352 929 9281 10.009697
Chuck 1810 202 2012 10.039761
Doug 30710 3413 34123 10.002051
Total 42960 4777 47737 10.006913
正如我所說,仍然很笨拙,因此非常歡迎有關讓其他更優雅的建議解決方案處理此類數據的任何提示
這是 TMbailey 的答案,經過調整后可以在我的 Python 版本中使用。 (不想編輯其他人的答案,但如果我做錯了,我會將其刪除。)這真的很棒,而且速度非常快!
import pandas as pd
import sys
percentage_to_flag = .01
# Toy data:
y = ['Alex'] * 2321 + ['Eddie'] * 876 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #optional shuffle, just to show order doesn't matter
# Pick out random sample of dataframe.
random_state = 41 # Change to get different random values.
df_sample = df.groupby("persons").apply(lambda x: x.sample(frac=percentage_to_flag,random_state=random_state))
#had to use lambda in line above
df_sample = df_sample.reset_index(level=0, drop=True) #had to add this to simplify multi-index DF
# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
然后檢查:
pp = df.pivot_table(index="persons", columns="marked", values="data", aggfunc='count', fill_value=0)
pp.columns = ['no','yes']
pp = pp.append(pp.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
pp['% selected'] = 100 * pp.yes/pp.Total
print(pp)
OUTPUT:
no yes Total % selected
persons
Alex 2298 23 2321 0.990952
Bob 9188 93 9281 1.002047
Chuck 1992 20 2012 0.994036
Doug 33781 342 34123 1.002257
Eddie 867 9 876 1.027397
Total 48126 487 48613 1.001790
我唯一的保留是采樣有時會低於設定的百分比......一直在查看DF.sample()
文檔,看看是否有辦法在采樣時“四舍五入”,但到目前為止還沒有不知道這是否可能。 討厭在percentage_to_flag(例如0.001)中添加“fudge factor”,因為該因素會根據百分比的大小而變化(即,對於較低的百分比,需要更大一點)。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.