如何使用pyspark对多数类进行低采样

Question

我试图像下面的代码那样解决数据，但是我没有使用groupy和udf弄清楚它，并且发现udf无法返回数据帧。

有什么方法可以通过spark或其他方法来处理不平衡数据

 ratio = 3 def balance_classes(grp): picked = grp.loc[grp.editorsSelection == True] n = round(picked.shape[0]*ratio) if n: try: not_picked = grp.loc[grp.editorsSelection == False].sample(n) except: # In case, fewer than n comments with `editorsSelection == False` not_picked = grp.loc[grp.editorsSelection == False] balanced_grp = pd.concat([picked, not_picked]) return balanced_grp else: # If no editor's pick for an article, dicard all comments from that article return None comments = comments.groupby('articleID').apply(balance_classes).reset_index(drop=True)

Answer 1

我通常使用这种逻辑来欠采样：

def resample(base_features,ratio,class_field,base_class):
    pos = base_features.filter(col(class_field)==base_class)
    neg = base_features.filter(col(class_field)!=base_class)
    total_pos = pos.count()
    total_neg = neg.count()
    fraction=float(total_pos*ratio)/float(total_neg)
    sampled = neg.sample(False,fraction)
    return sampled.union(pos)

base_feature是具有功能的Spark数据框。 ratio是期望的正负之间的比率class_field是保存类的列的名称，而base_class是类的ID

如何使用pyspark对多数类进行低采样

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-12-31 19:15:38

如何使用pyspark对多数类进行低采样

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-12-31 19:15:38

解决方案1
0 已采纳 2018-12-31 19:15:38