SKlearn SGD Partial Fit

Question

What I am doing wrong here? 我在这做错了什么？ I have a large data set that I want to perform a partial fit on using Scikit-learn's SGDClassifier 我有一个大型数据集，我想使用Scikit-learn的SGDClassifier进行部分调整

I do the following 我做了以下事情

from sklearn.linear_model import SGDClassifier
import pandas as pd

chunksize = 5
clf2 = SGDClassifier(loss='log', penalty="l2")

for train_df in pd.read_csv("train.csv", chunksize=chunksize, iterator=True):
    X = train_df[features_columns]
    Y = train_df["clicked"]
    clf2.partial_fit(X, Y)

I'm getting the error 我收到了错误

Traceback (most recent call last): File "/predict.py", line 48, in sys.exit(0 if main() else 1) File "/predict.py", line 44, in main predict() File "/predict.py", line 38, in predict clf2.partial_fit(X, Y) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py", line 512, in partial_fit coef_init=None, intercept_init=None) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py", line 349, in _partial_fit _check_partial_fit_first_call(self, classes) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/utils/multiclass.py", line 297, in _check_partial_fit_first_call raise ValueError("classes must be passed on the first call " ValueError: classes must be passed on the first call to partial_fit. 回溯（最近一次调用最后一次）：文件“/predict.py”，第48行，在sys.exit中（0如果是main（）else 1）文件“/predict.py”，第44行，在main predict（）文件中/predict.py“，第38行，预测clf2.partial_fit（X，Y）文件”/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py“，第512行，在partial_fit中coef_init = None，intercept_init = None）文件“/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py”，第349行，在_partial_fit _check_partial_fit_first_call（self，classes）文件“/ Users / anaconda / lib / python3.5 / site-packages / sklearn / utils / multiclass.py“，第297行，在_check_partial_fit_first_call中引发ValueError（”类必须在第一次调用时传递“ValueError：必须在第一次调用时传递类partial_fit。

Answer 1

Please notice that the classifier does not know the number of classes at the beginning, therefore for the first pass, you need to tell the number of classes using np.unique(target), where target is the class column. 请注意，分类器在开始时不知道类的数量，因此对于第一遍，您需要使用np.unique（target）来告知类的数量，其中target是类列。 Because you are reading the data in chunks, you need to make sure that your first chunk has all possible values for the class label, so it works! 因为您正在以块的形式读取数据，所以您需要确保第一个块具有类标签的所有可能值，因此它可以工作！ Therefore, your code would be: 因此，您的代码将是：

for train_df in pd.read_csv("train.csv", chunksize=chunksize, iterator=True):
   X = train_df[features_columns]
   Y = train_df["clicked"]
   clf2.partial_fit(X, Y, classes=np.unique(Y))

Answer 2

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit

clf2.partial_fit(X, Y, classes=np.unique(Y))

Suppose you don't have sufficient record of class and so classifier need values of a total number of classes that need to be classified. 假设您没有足够的类记录，因此分类器需要需要分类的类总数的值。

SKlearn SGD Partial Fit

问题描述

2 个解决方案

解决方案1
10 已采纳 2017-05-01 22:53:10

解决方案2
1 2017-05-11 18:24:07

SKlearn SGD Partial Fit

问题描述

2 个解决方案

解决方案1 10 已采纳 2017-05-01 22:53:10

解决方案2 1 2017-05-11 18:24:07

解决方案1
10 已采纳 2017-05-01 22:53:10

解决方案2
1 2017-05-11 18:24:07