[英]Is it feasible to have the training set < the test set after undersampling the majority class?
[英]How to undersampling the majority class using pyspark
我試圖像下面的代碼那樣解決數據,但是我沒有使用groupy和udf弄清楚它,並且發現udf無法返回數據幀。
有什么方法可以通過spark或其他方法來處理不平衡數據
ratio = 3 def balance_classes(grp): picked = grp.loc[grp.editorsSelection == True] n = round(picked.shape[0]*ratio) if n: try: not_picked = grp.loc[grp.editorsSelection == False].sample(n) except: # In case, fewer than n comments with `editorsSelection == False` not_picked = grp.loc[grp.editorsSelection == False] balanced_grp = pd.concat([picked, not_picked]) return balanced_grp else: # If no editor's pick for an article, dicard all comments from that article return None comments = comments.groupby('articleID').apply(balance_classes).reset_index(drop=True)
我通常使用這種邏輯來欠采樣:
def resample(base_features,ratio,class_field,base_class):
pos = base_features.filter(col(class_field)==base_class)
neg = base_features.filter(col(class_field)!=base_class)
total_pos = pos.count()
total_neg = neg.count()
fraction=float(total_pos*ratio)/float(total_neg)
sampled = neg.sample(False,fraction)
return sampled.union(pos)
base_feature是具有功能的Spark數據框。 ratio是期望的正負之間的比率class_field是保存類的列的名稱,而base_class是類的ID
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.