為什么我的 sklearn SVC 在手動縮放而不是使用管道時要慢得多？

Question

我想使用TokenVectorizer在imdb數據集上構建 VSC。 在文檔中，它說要擴展訓練/測試數據以獲得更好的結果。 有使用管道的示例代碼，但它也應該與手動縮放一起使用。

import numpy as np
import datasets
import pandas as pd
import timeit
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# prepare dataset
training_data = pd.DataFrame(datasets.load_dataset("imdb", split="train"))
training_data = training_data.sample(frac=1).reset_index(drop=True)
test_data = pd.DataFrame(datasets.load_dataset("imdb", split="test"))
test_data = test_data.sample(frac=1).reset_index(drop=True)

vect = CountVectorizer()
data = vect.fit_transform(training_data["text"].append(test_data["text"])).toarray()
train = data[0:25000,]
test = data[25000:50000,]

# find most frequent words in the comments
count = np.sum(train, axis=0)
ind = np.argsort(count)[::-1]

ks = [10]
#ks = [10, 50, 100, 500, 1000, 2000] # I want to compare the results for different k

# reduce features to the k most frequent tokens
# columns are already sorted by frequency desc
k_ind = ind[:max(ks)]
X = np.ascontiguousarray(train[:,k_ind])
y = training_data["label"]
test_set = np.ascontiguousarray(test[:,k_ind])

print(f"Check if X is C-contiguous: {X[:,:min(ks)].flags}")

# Test the execution time with pipeline first
for k in ks:
  clf = make_pipeline(StandardScaler(), SVC(C=1, kernel='linear', cache_size=4000)
  # only use k features
  t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
  print(f"Time with pipeline: {t}s")

# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
scaler.transform(X, copy=False)
scaler.fit(test_set)
scaler.transform(test_set, copy=False)

for k in ks:
  clf = SVC(C=1, kernel='linear', cache_size=4000)
  t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
  print(f"Time with manual scaling: {t}s")

這將產生輸出：

Check if X is C-contiguous:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
Time with pipeline: 58.852547400000276s
Time with manual scaling: 181.0459952000001s

如您所見，管道要快得多，為什么會這樣？ 我想為不同的k測試分類器，但是管道和縮放器將在同一個訓練數據上被多次調用，縮放已經縮放的數據會返回相同的結果還是每次迭代時都會改變（這就是為什么我手動縮放然后切片縮放的數據）？

Answer 1

好吧，我只是忘了保存縮放的數組......

<...>
# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X, copy=False)
scaler.fit(test_set)
test_set = scaler.transform(test_set, copy=False)
<...>

現在這兩種方法都需要相同的時間。

為什么我的 sklearn SVC 在手動縮放而不是使用管道時要慢得多？

問題描述

1 個解決方案

解決方案1
0 2022-05-18 20:45:46

為什么我的 sklearn SVC 在手動縮放而不是使用管道時要慢得多？

問題描述

1 個解決方案

解決方案1 0 2022-05-18 20:45:46

解決方案1
0 2022-05-18 20:45:46