[英]Extract rows from dataframe containing each label
在下面顯示的示例dataframe
,我有 5 個標簽( class_name
)。 總共有 31 個文件(31 行)可以使用...
我正在嘗試提取 80%(可以是可變的)行(= 24 行(整數))。 但是,我想確保從每個class_name
中提取至少 1 行
對於我的嘗試,我只能手動執行此操作。 這種方法變得乏味,因為class_name
遠不止 10 個。您能幫我提取正確的 % 行以從每個標簽( class_name
)中至少包含 1 個條目。
這是我的嘗試:
import math
import pandas as pd
base_path = 'G:/PandasFileSeperation'
original_df = pd.read_csv(f'{base_path}/Book2.csv')
original_df = original_df.astype(str)
length = original_df.class_name.count()
length
# Get number of groups
dfg = original_df.groupby('class_name')
numgroups = dfg.ngroups
numgroups
# Get the sizes of each group
group_size = original_df.groupby('class_name').size()
# Get length of original dataframe
Total_dataset_size = len(original_df)
# Get number of Training samples
TrainPercent = 0.80
Train_size = int(Total_dataset_size * TrainPercent)
Train_size
# How can I change this to automatically change the label size length to give at least 1 row from each class?
Label_0_size = 5
Label_1_size = 3
Label_2_size = 7
Label_3_size = 7
Label_4_size = 2
# Training Dataset
label_percent = { 'pigs' : Label_0_size, 'goats' : Label_1_size, 'chickens' : Label_2_size, 'hens' : Label_3_size, 'sheep' : Label_4_size}
flag = True
for label, num_rows in label_percent.items():
row_num = num_rows
if label == 'pigs':
row_num0 = Label_0_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num0)
elif label == 'goats':
row_num2 = Label_1_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num1)
elif label == 'chickens':
row_num2 = Label_2_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num2)
elif label == 'hens':
row_num3 = Label_3_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num3)
else:
row_num4 = Label_4_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num4)
if flag == True:
Train_df = df
flag = False
else:
Train_df = pd.concat([Train_df, df])
Train_df.to_csv(f'{base_path}/Train_df.csv', encoding='utf-8')
切片文件名 | fsID | 開始 | 結尾 | 顯着性 | 折疊 | 班級號 | 班級名稱 | original_class |
---|---|---|---|---|---|---|---|---|
1-1000020520400.wav | 1 | 1 | 豬 | |||||
1-100004024000001.wav | 1 | 1 | 豬 | |||||
1-10000406050001.wav | 1 | 1 | 豬 | |||||
1-1000050120400.wav | 1 | 1 | 豬 | |||||
1-1000050320400.wav | 1 | 1 | 豬 | |||||
1-1000050520400.wav | 1 | 2 | 山羊 | |||||
1-10000601400001000.wav | 1 | 2 | 山羊 | |||||
1-1000060340000.wav | 1 | 2 | 山羊 | |||||
1-100006070500.wav | 1 | 3 | 雞 | |||||
1-100007020800.wav | 1 | 3 | 雞 | |||||
1-100007024000001.wav | 1 | 3 | 雞 | |||||
1-1000070320400.wav | 1 | 3 | 雞 | |||||
1-100007050800.wav | 1 | 3 | 雞 | |||||
1-100007064000001.wav | 1 | 3 | 雞 | |||||
1-100010620400.wav | 1 | 3 | 雞 | |||||
1-100040620400.wav | 1 | 3 | 雞 | |||||
1-10006020500.wav | 1 | 3 | 雞 | |||||
1-10006030500.wav | 1 | 3 | 雞 | |||||
1-100060520400.wav | 1 | 4 | 母雞 | |||||
1-10007020500.wav | 1 | 4 | 母雞 | |||||
2-100070420400.wav | 1 | 4 | 母雞 | |||||
2-100070540000.wav | 1 | 4 | 母雞 | |||||
2-1313131313004.wav | 1 | 4 | 母雞 | |||||
2-1313131313043.wav | 1 | 4 | 母雞 | |||||
2-1313131313044.wav | 1 | 5 | 羊 | |||||
2-150002020500.wav | 1 | 5 | 羊 | |||||
2-150002060800.wav | 1 | 5 | 羊 | |||||
2-150004022040001.wav | 1 | 5 | 羊 | |||||
2-15000406050001.wav | 1 | 5 | 羊 | |||||
2-150006014000001.wav | 1 | 5 | 羊 | |||||
2-150006024000001.wav | 1 | 5 | 羊 |
作為開始,我們可以做一個forloop
並從每個類 = 1 開始,然后遞增每個類,在每次迭代后檢查總和,直到總類 = Train_size
?
我認為這應該有所幫助。 對數據幀進行兩次采樣:一次使用groupby
為每個名稱采樣一行,然后從數據幀的其余部分隨機采樣以完成 80% 的訓練集。
TrainPercent = 0.8
# sample one row for each class_name (5 rows)
one_each = df.groupby('class_name').sample(n=1)
# from the rest of the rows, sample int(0.8*len(df))-len(one_each) number of rows (19 rows)
rest = df.loc[~df.index.isin(one_each.index)].sample(n=int(TrainPercent*len(df))-len(one_each))
# concatenate the two
res = pd.concat([one_each, rest])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.