簡體   English   中英

從包含每個標簽的數據框中提取行

[英]Extract rows from dataframe containing each label

在下面顯示的示例dataframe ,我有 5 個標簽( class_name )。 總共有 31 個文件(31 行)可以使用...

我正在嘗試提取 80%(可以是可變的)行(= 24 行(整數))。 但是,我想確保從每個class_name中提取至少 1 行

對於我的嘗試,我只能手動執行此操作。 這種方法變得乏味,因為class_name遠不止 10 個。您能幫我提取正確的 % 行以從每個標簽( class_name )中至少包含 1 個條目。

這是我的嘗試:

import math
import pandas as pd

base_path = 'G:/PandasFileSeperation'
original_df = pd.read_csv(f'{base_path}/Book2.csv')

original_df = original_df.astype(str)
length = original_df.class_name.count()
length

# Get number of groups
dfg = original_df.groupby('class_name')
numgroups = dfg.ngroups
numgroups

# Get the sizes of each group
group_size =  original_df.groupby('class_name').size()

# Get length of original dataframe
Total_dataset_size = len(original_df)

# Get number of Training samples
TrainPercent = 0.80

Train_size = int(Total_dataset_size * TrainPercent)
Train_size

# How can I change this to automatically change the label size length to give at least 1 row from each class?
Label_0_size = 5
Label_1_size = 3
Label_2_size = 7
Label_3_size = 7
Label_4_size = 2

# Training Dataset
label_percent = { 'pigs' : Label_0_size, 'goats' : Label_1_size, 'chickens' : Label_2_size, 'hens' : Label_3_size, 'sheep' : Label_4_size}

flag = True
for label, num_rows in label_percent.items():
  
  row_num = num_rows
  
  if label == 'pigs':
    row_num0 = Label_0_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num0)

  elif label == 'goats':
    row_num2 = Label_1_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num1)
    
    
  elif label == 'chickens':
    row_num2 = Label_2_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num2)
    
  elif label == 'hens':
    row_num3 = Label_3_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num3)
    
  else:
    row_num4 = Label_4_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num4)

  if flag == True:
    Train_df = df
    flag = False
  else:
    Train_df = pd.concat([Train_df, df])

Train_df.to_csv(f'{base_path}/Train_df.csv', encoding='utf-8')
數據框:
切片文件名 fsID 開始 結尾 顯着性 折疊 班級號 班級名稱 original_class
1-1000020520400.wav 1 1
1-100004024000001.wav 1 1
1-10000406050001.wav 1 1
1-1000050120400.wav 1 1
1-1000050320400.wav 1 1
1-1000050520400.wav 1 2 山羊
1-10000601400001000.wav 1 2 山羊
1-1000060340000.wav 1 2 山羊
1-100006070500.wav 1 3
1-100007020800.wav 1 3
1-100007024000001.wav 1 3
1-1000070320400.wav 1 3
1-100007050800.wav 1 3
1-100007064000001.wav 1 3
1-100010620400.wav 1 3
1-100040620400.wav 1 3
1-10006020500.wav 1 3
1-10006030500.wav 1 3
1-100060520400.wav 1 4 母雞
1-10007020500.wav 1 4 母雞
2-100070420400.wav 1 4 母雞
2-100070540000.wav 1 4 母雞
2-1313131313004.wav 1 4 母雞
2-1313131313043.wav 1 4 母雞
2-1313131313044.wav 1 5
2-150002020500.wav 1 5
2-150002060800.wav 1 5
2-150004022040001.wav 1 5
2-15000406050001.wav 1 5
2-150006014000001.wav 1 5
2-150006024000001.wav 1 5

作為開始,我們可以做一個forloop並從每個類 = 1 開始,然后遞增每個類,在每次迭代后檢查總和,直到總類 = Train_size

我認為這應該有所幫助。 對數據幀進行兩次采樣:一次使用groupby為每個名稱采樣一行,然后從數據幀的其余部分隨機采樣以完成 80% 的訓練集。

TrainPercent = 0.8
# sample one row for each class_name (5 rows)
one_each = df.groupby('class_name').sample(n=1)
# from the rest of the rows, sample int(0.8*len(df))-len(one_each) number of rows (19 rows)
rest = df.loc[~df.index.isin(one_each.index)].sample(n=int(TrainPercent*len(df))-len(one_each))
# concatenate the two
res = pd.concat([one_each, rest])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM