How to undersampling the majority class using pyspark

Question

I try to solve the data like below code,but I have not figured it out using groupy and udf ,and also find udf can not return dataframe.

Is there any way that could implement this by spark or some other method could handle unbalanced data

 ratio = 3 def balance_classes(grp): picked = grp.loc[grp.editorsSelection == True] n = round(picked.shape[0]*ratio) if n: try: not_picked = grp.loc[grp.editorsSelection == False].sample(n) except: # In case, fewer than n comments with `editorsSelection == False` not_picked = grp.loc[grp.editorsSelection == False] balanced_grp = pd.concat([picked, not_picked]) return balanced_grp else: # If no editor's pick for an article, dicard all comments from that article return None comments = comments.groupby('articleID').apply(balance_classes).reset_index(drop=True)

Answer 1

I usually use this logic to undersample:

def resample(base_features,ratio,class_field,base_class):
    pos = base_features.filter(col(class_field)==base_class)
    neg = base_features.filter(col(class_field)!=base_class)
    total_pos = pos.count()
    total_neg = neg.count()
    fraction=float(total_pos*ratio)/float(total_neg)
    sampled = neg.sample(False,fraction)
    return sampled.union(pos)

base_feature is a spark dataframe that has the features. ratio is the desired ratio between positives and negatives class_field is the name of the column that holds the classes and base_class is the id of the class

How to undersampling the majority class using pyspark

Question

1 answers

solution1
0 ACCPTED 2018-12-31 19:15:38

How to undersampling the majority class using pyspark

Question

1 answers

solution1 0 ACCPTED 2018-12-31 19:15:38

solution1
0 ACCPTED 2018-12-31 19:15:38