![](/img/trans.png)
[英]implement custom one-hot-encoding function for sklearn pipeline
[英]Custom resampler for a one-v-one in a pipeline
我正在努力實現基於SVM
工作的自定義欠采樣器。 該類通過選擇類的支持向量附近的多數示例,對多數類進行欠采樣到少數類的大小,直到少數示例的大小。
這是代碼:
import numpy as np
from collections import Counter
from sklearn.svm import SVC
class NearSVUmdersampler():
def __init__(self, random_state=None):
self.random_state = random_state
def fit_resample(self, X, y):
random_state = check_random_state(self.random_state)
# class distribution
counter = Counter(y)
maj_class = counter.most_common()[0][0]
min_class = counter.most_common()[-1][0]
# number of minority examples
num_minority = len(X[ y == min_class])
svc = SVC(kernel='rbf', random_state=32)
svc.fit(X,y)
# majority class support vectors
maj_sup_vector = svc.support_vectors_[maj_class]
# compute distances to support vector points
distances = []
for i, x in enumerate(X[y == maj_class]):
d = np.linalg.norm(maj_sup_vector - x)
distances.append((i, d))
# sort distances (ascending)
distances.sort(key=lambda tup: tup[1])
index = [i for i, d in distances][:num_minority]
X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))
return X_ds, y_ds
該類返回的重采樣數據與多數類平衡到等於少數。
所以我想在管道中使用這個類進行multiclass
分類。 我的意圖是在一對一的情況下執行此操作,以便在每個ovo
情況下,調用欠映射來重新采樣ovo
中當前參與類的數據。
因此,例如,使用此虛擬數據:
# sample data
X, y = make_classification(n_samples=2000, n_features=2, n_redundant=0,
n_clusters_per_class=1, n_classes=4, weights=[0.08, 0.12, 0.2], flip_y=0, random_state=162)
xtrain, xtest, ytrain, ytest = train_test_split(X, y,
test_size=.2, random_state=12)
Counter(ytrain)
Counter({0: 126, 1: 192, 2: 330, 3: 952})
在ovo
案例中,我將有4(3-1)/2=6
模型。 因此,在每個“ovo”模型中,多數類欠采樣應該是這樣的:
Model 1 = Class 0 Vs Class 1 # maj:1=192; undersampled to 126, -> 0:126, 1:126
Model 2 = Class 0 Vs Class 2 # maj:2=330; undersampled to 126 -> 0:126, 2:126
Model 3 = Class 0 Vs Class 3 # maj:3=952; undersampled to 126, -> 0:126, 3:126
Model 4 = Class 1 Vs Class 2 # maj:2=330; undersampled to 192 -> 1:192, 2:192
Model 5 = Class 1 Vs Class 3 # maj:3=952; undersampled to 192 -> 1:192, 3:192
Model 6 = Class 2 Vs Class 3 # maj:3=952; undersampled to 330 -> 2:330, 3:330
考慮到這一點,我有興趣使用SVC
作為OneVsOneClassifier
的估計器,如下所示:
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier
model = OneVsOneClassifier(
estimator=SVC(kernel='rbf'), n_jobs=-1)
resampler = NearSVUmdersampler(random_state=123)
並將其擬合為:
classifier = Pipeline([('sampler', resampler), ('clf', model) ])
classifier.fit(xtrain, ytrain)
Pipeline(steps=[('sampler',
<__main__.NearSVUmdersampler object at 0x7f4386fa30d0>),
('clf', OneVsOneClassifier(estimator=SVC(), n_jobs=-1))])
問題:
似乎重采樣器只被調用一次,將包含所有類的所有訓練數據傳遞給它。 所以它只返回原始數據中的多數和少數,重新采樣到多數的大小。 使其僅在兩個課程上進行培訓。
例如,在上面的 MWE 中,它返回:
{0: 126, 3: 126} # the majarity: 3=952; undersampled to minority: 0=126
這就是Model 3
的情況,對於所有其他情況都沒有做任何事情。
考慮到我擁有的管道,如何在ovo
中完成這項工作?
嘗試這個:
model = SVC(kernel='rbf')
resampler = NearSVUmdersampler(random_state=123)
base_estimator = Pipeline([('sampler', resampler), ('clf', model)])
classifier = OneVsOneClassifier(estimator=base_estimator)
現在,當您調用classifier.fit
時, OneVsOneClassifier
將適合每個數據切片的base_estimator
管道,從而為每對列重新采樣
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.