在CountVectorizer中使用`transform` vs.`fit_transform`的问题

Question

I have successfully trained and tested a Logistic Regression model with CountVectorizer() as such: 我已经成功地使用CountVectorizer()训练和测试了Logistic回归模型：

def train_model(classifier, feature_vector_train, label):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    return classifier

def getPredictions (classifier, feature_vector_valid):    
    # predict the labels on validation dataset
    predict = classifier.predict(feature_vector_valid)

    return metrics.accuracy_score(predict, valid_y)

def createTrainingAndValidation(column):
    global train_x, valid_x, train_y, valid_y
    train_x, valid_x, train_y, valid_y = model_selection.train_test_split(finalDF[column], finalDF['DeedType1'])

def createCountVectorizer(column):
    global xtrain_count, xvalid_count
    # create a count vectorizer object 
    count_vect = CountVectorizer()
    count_vect.fit(finalDF[column])

    # transform the training and validation data using count vectorizer object
    xtrain_count =  count_vect.transform(train_x)
    xvalid_count =  count_vect.transform(valid_x)

createTrainingAndValidation('Test')
createCountVectorizer('Test')
classifier = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)
predictions = getPredictions(classifier, xvalid_count)

I was using a DataFrame called finalDF with all labelled text. 我正在使用一个名为finalDF并带有所有带标签的文本。 Since this model is giving me 0.68 accuracy I was going to test it on a subset of the DataFrame with an unknown label. 由于此模型的精度为0.68，因此我将在带有未知标签的DataFrame子集上对其进行测试。 This was not included in the training and testing phase. 这没有包括在培训和测试阶段。 I saved the trained model as bestClassifier . 我将训练bestClassifier模型保存为bestClassifier 。

Now I got the subset of unknown text and tried to do the following: 现在，我得到了未知文本的子集，并尝试执行以下操作：

count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])
text = unknownDf['Text']
xvalid_count =  count_vect.transform(text)

bestClassifier.predict(xvalid_count)

finalDF has 800 rows while unknownDf has only 32 rows after what I do above.. How do I rectify this? finalDF具有800行，而unknownDf仅具有32行。我如何纠正这一点？

Answer 1

I think I see what's going on, In this peice of code: 我想我看到发生了什么事，在这段代码中：

def createCountVectorizer(column):
    global xtrain_count, xvalid_count
    # create a count vectorizer object 
    count_vect = CountVectorizer()
    count_vect.fit(finalDF[column])

    # transform the training and validation data using count vectorizer object
    xtrain_count =  count_vect.transform(train_x)
    xvalid_count =  count_vect.transform(valid_x)

You are declaring a CountVectorizer() , calling fit and then transform . 您要声明CountVectorizer() ，先调用fit ，然后进行transform 。 What you need to do is, USE THE SAME CountVectorizer() to transform on unknownDf['Text'] . 您需要做的是，使用相同的CountVectorizer() transform unknownDf['Text'] 。

When you do this: 执行此操作时：

count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])
text = unknownDf['Text']
xvalid_count =  count_vect.transform(text)

You are creating a brand new CountVectorizer() , that creates a new bag of words for unknownDf['Text'] , when what you should be doing is, removing these two lines 您正在创建一个全新的CountVectorizer() ，这将为unknownDf['Text']创建一个新的单词包，当您要做的是删除这两行时，

count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])

and let the existing CountVectorizer() which you FIT on finalDF[column] , use that to transform unknownDf['Text'] . 并让现有CountVectorizer()你FIT上finalDF[column] ，用它来transform unknownDf['Text']

Find a way to USE the CountVectorizer() in your createCountVectorizer(column) which you declared as count_vect to transform the unknownDf['Text'] . 在您声明为count_vect createCountVectorizer(column)中找到一种使用CountVectorizer() count_vect来transform unknownDf['Text'] 。

在CountVectorizer中使用`transform` vs.`fit_transform`的问题

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-08-24 13:17:26

在CountVectorizer中使用`transform` vs.`fit_transform`的问题

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-08-24 13:17:26

解决方案1
2 已采纳 2018-08-24 13:17:26