[英]Sklearn Pipeline: is there leakage /bias when including scaling in the pipeline?
[英]Why is my sklearn SVC much slower when manually scaling instead of using the pipeline?
我想使用TokenVectorizer
在imdb
數據集上構建 VSC。 在文檔中,它說要擴展訓練/測試數據以獲得更好的結果。 有使用管道的示例代碼,但它也應該與手動縮放一起使用。
import numpy as np
import datasets
import pandas as pd
import timeit
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# prepare dataset
training_data = pd.DataFrame(datasets.load_dataset("imdb", split="train"))
training_data = training_data.sample(frac=1).reset_index(drop=True)
test_data = pd.DataFrame(datasets.load_dataset("imdb", split="test"))
test_data = test_data.sample(frac=1).reset_index(drop=True)
vect = CountVectorizer()
data = vect.fit_transform(training_data["text"].append(test_data["text"])).toarray()
train = data[0:25000,]
test = data[25000:50000,]
# find most frequent words in the comments
count = np.sum(train, axis=0)
ind = np.argsort(count)[::-1]
ks = [10]
#ks = [10, 50, 100, 500, 1000, 2000] # I want to compare the results for different k
# reduce features to the k most frequent tokens
# columns are already sorted by frequency desc
k_ind = ind[:max(ks)]
X = np.ascontiguousarray(train[:,k_ind])
y = training_data["label"]
test_set = np.ascontiguousarray(test[:,k_ind])
print(f"Check if X is C-contiguous: {X[:,:min(ks)].flags}")
# Test the execution time with pipeline first
for k in ks:
clf = make_pipeline(StandardScaler(), SVC(C=1, kernel='linear', cache_size=4000)
# only use k features
t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
print(f"Time with pipeline: {t}s")
# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
scaler.transform(X, copy=False)
scaler.fit(test_set)
scaler.transform(test_set, copy=False)
for k in ks:
clf = SVC(C=1, kernel='linear', cache_size=4000)
t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
print(f"Time with manual scaling: {t}s")
這將產生輸出:
Check if X is C-contiguous:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
Time with pipeline: 58.852547400000276s
Time with manual scaling: 181.0459952000001s
如您所見,管道要快得多,為什么會這樣? 我想為不同的k
測試分類器,但是管道和縮放器將在同一個訓練數據上被多次調用,縮放已經縮放的數據會返回相同的結果還是每次迭代時都會改變(這就是為什么我手動縮放然后切片縮放的數據)?
好吧,我只是忘了保存縮放的數組......
<...>
# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X, copy=False)
scaler.fit(test_set)
test_set = scaler.transform(test_set, copy=False)
<...>
現在這兩種方法都需要相同的時間。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.