簡體   English   中英

管道中一對一的自定義重采樣器

[英]Custom resampler for a one-v-one in a pipeline

我正在努力實現基於SVM工作的自定義欠采樣器。 該類通過選擇類的支持向量附近的多數示例,對多數類進行欠采樣到少數類的大小,直到少數示例的大小。

這是代碼:


import numpy as np
from collections import Counter

from sklearn.svm import SVC

class NearSVUmdersampler():
  def __init__(self, random_state=None):
    self.random_state = random_state
  
  def fit_resample(self, X, y):
    random_state = check_random_state(self.random_state)
    # class distribution
    counter = Counter(y)
    maj_class = counter.most_common()[0][0]
    min_class = counter.most_common()[-1][0]
    # number of minority examples
    num_minority = len(X[ y == min_class])
    svc = SVC(kernel='rbf', random_state=32)
    svc.fit(X,y)
    # majority class support vectors
    maj_sup_vector = svc.support_vectors_[maj_class]
    # compute distances to support vector points
    distances = []
    for i, x in enumerate(X[y == maj_class]):
      d = np.linalg.norm(maj_sup_vector - x) 
      distances.append((i, d))
    # sort distances (ascending)
    distances.sort(key=lambda tup: tup[1])
    index = [i for i, d in distances][:num_minority]
    X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
    y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))

    return X_ds, y_ds 

該類返回的重采樣數據與多數類平衡到等於少數。

所以我想在管道中使用這個類進行multiclass分類。 我的意圖是在一對一的情況下執行此操作,以便在每個ovo情況下,調用欠映射來重新采樣ovo中當前參與類的數據。

因此,例如,使用此虛擬數據:

# sample data
X, y = make_classification(n_samples=2000, n_features=2, n_redundant=0,
    n_clusters_per_class=1, n_classes=4, weights=[0.08, 0.12, 0.2], flip_y=0, random_state=162)

xtrain, xtest, ytrain, ytest = train_test_split(X, y, 
                test_size=.2, random_state=12)

Counter(ytrain)
Counter({0: 126, 1: 192, 2: 330, 3: 952})

ovo案例中,我將有4(3-1)/2=6模型。 因此,在每個“ovo”模型中,多數類欠采樣應該是這樣的:

Model 1 = Class 0 Vs Class 1 # maj:1=192; undersampled to 126, -> 0:126, 1:126 
Model 2 = Class 0 Vs Class 2 # maj:2=330; undersampled to 126  -> 0:126, 2:126
Model 3 = Class 0 Vs Class 3 # maj:3=952; undersampled to 126, -> 0:126, 3:126
Model 4 = Class 1 Vs Class 2 # maj:2=330; undersampled to 192  -> 1:192, 2:192
Model 5 = Class 1 Vs Class 3 # maj:3=952; undersampled to 192  -> 1:192, 3:192
Model 6 = Class 2 Vs Class 3 # maj:3=952; undersampled to 330  -> 2:330, 3:330

考慮到這一點,我有興趣使用SVC作為OneVsOneClassifier的估計器,如下所示:

from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier

model = OneVsOneClassifier(
    estimator=SVC(kernel='rbf'), n_jobs=-1)

resampler = NearSVUmdersampler(random_state=123)

並將其擬合為:

classifier = Pipeline([('sampler', resampler), ('clf', model) ])
classifier.fit(xtrain, ytrain)
Pipeline(steps=[('sampler',
                 <__main__.NearSVUmdersampler object at 0x7f4386fa30d0>),
                ('clf', OneVsOneClassifier(estimator=SVC(), n_jobs=-1))])

問題:

似乎重采樣器只被調用一次,將包含所有類的所有訓練數據傳遞給它。 所以它只返回原始數據中的多數和少數,重新采樣到多數的大小。 使其僅在兩個課程上進行培訓。

例如,在上面的 MWE 中,它返回:

{0: 126, 3: 126} # the majarity: 3=952; undersampled to minority: 0=126

這就是Model 3的情況,對於所有其他情況都沒有做任何事情。

考慮到我擁有的管道,如何在ovo中完成這項工作?

嘗試這個:

model = SVC(kernel='rbf')
resampler = NearSVUmdersampler(random_state=123)
base_estimator = Pipeline([('sampler', resampler), ('clf', model)])
classifier = OneVsOneClassifier(estimator=base_estimator)

現在,當您調用classifier.fit時, OneVsOneClassifier將適合每個數據切片的base_estimator管道,從而為每對列重新采樣

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM