简体   繁体   中英

How to undersampling the majority class using pyspark

I try to solve the data like below code,but I have not figured it out using groupy and udf ,and also find udf can not return dataframe.

Is there any way that could implement this by spark or some other method could handle unbalanced data

 ratio = 3 def balance_classes(grp): picked = grp.loc[grp.editorsSelection == True] n = round(picked.shape[0]*ratio) if n: try: not_picked = grp.loc[grp.editorsSelection == False].sample(n) except: # In case, fewer than n comments with `editorsSelection == False` not_picked = grp.loc[grp.editorsSelection == False] balanced_grp = pd.concat([picked, not_picked]) return balanced_grp else: # If no editor's pick for an article, dicard all comments from that article return None comments = comments.groupby('articleID').apply(balance_classes).reset_index(drop=True) 

I usually use this logic to undersample:

def resample(base_features,ratio,class_field,base_class):
    pos = base_features.filter(col(class_field)==base_class)
    neg = base_features.filter(col(class_field)!=base_class)
    total_pos = pos.count()
    total_neg = neg.count()
    fraction=float(total_pos*ratio)/float(total_neg)
    sampled = neg.sample(False,fraction)
    return sampled.union(pos)

base_feature is a spark dataframe that has the features. ratio is the desired ratio between positives and negatives class_field is the name of the column that holds the classes and base_class is the id of the class

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM