[英]Issue with usages of `transform` vs. `fit_transform` in CountVectorizer
I have successfully trained and tested a Logistic Regression model with CountVectorizer()
as such: 我已经成功地使用CountVectorizer()
训练和测试了Logistic回归模型:
def train_model(classifier, feature_vector_train, label):
# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)
return classifier
def getPredictions (classifier, feature_vector_valid):
# predict the labels on validation dataset
predict = classifier.predict(feature_vector_valid)
return metrics.accuracy_score(predict, valid_y)
def createTrainingAndValidation(column):
global train_x, valid_x, train_y, valid_y
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(finalDF[column], finalDF['DeedType1'])
def createCountVectorizer(column):
global xtrain_count, xvalid_count
# create a count vectorizer object
count_vect = CountVectorizer()
count_vect.fit(finalDF[column])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
createTrainingAndValidation('Test')
createCountVectorizer('Test')
classifier = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)
predictions = getPredictions(classifier, xvalid_count)
I was using a DataFrame called finalDF
with all labelled text. 我正在使用一个名为finalDF
并带有所有带标签的文本。 Since this model is giving me 0.68 accuracy I was going to test it on a subset of the DataFrame with an unknown label. 由于此模型的精度为0.68,因此我将在带有未知标签的DataFrame子集上对其进行测试。 This was not included in the training and testing phase. 这没有包括在培训和测试阶段。 I saved the trained model as bestClassifier
. 我将训练bestClassifier
模型保存为bestClassifier
。
Now I got the subset of unknown text and tried to do the following: 现在,我得到了未知文本的子集,并尝试执行以下操作:
count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])
text = unknownDf['Text']
xvalid_count = count_vect.transform(text)
bestClassifier.predict(xvalid_count)
finalDF
has 800 rows while unknownDf
has only 32 rows after what I do above.. How do I rectify this? finalDF
具有800行,而unknownDf
仅具有32行。我如何纠正这一点?
I think I see what's going on, In this peice of code: 我想我看到发生了什么事,在这段代码中:
def createCountVectorizer(column):
global xtrain_count, xvalid_count
# create a count vectorizer object
count_vect = CountVectorizer()
count_vect.fit(finalDF[column])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
You are declaring a CountVectorizer()
, calling fit
and then transform
. 您要声明CountVectorizer()
,先调用fit
,然后进行transform
。 What you need to do is, USE THE SAME CountVectorizer()
to transform
on unknownDf['Text']
. 您需要做的是,使用相同的CountVectorizer()
transform
unknownDf['Text']
。
When you do this: 执行此操作时:
count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])
text = unknownDf['Text']
xvalid_count = count_vect.transform(text)
You are creating a brand new CountVectorizer()
, that creates a new bag of words for unknownDf['Text']
, when what you should be doing is, removing these two lines 您正在创建一个全新的CountVectorizer()
,这将为unknownDf['Text']
创建一个新的单词包,当您要做的是删除这两行时,
count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])
and let the existing CountVectorizer()
which you FIT
on finalDF[column]
, use that to transform
unknownDf['Text']
. 并让现有CountVectorizer()
你FIT
上finalDF[column]
,用它来transform
unknownDf['Text']
Find a way to USE the CountVectorizer()
in your createCountVectorizer(column)
which you declared as count_vect
to transform
the unknownDf['Text']
. 在您声明为count_vect
createCountVectorizer(column)
中找到一种使用CountVectorizer()
count_vect
来transform
unknownDf['Text']
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.