fit_transform、transform 和 TfidfVectorizer 的工作原理

Question

I'm working on a fuzzy matching project and I have found a very interesting method : awesome_cossim_top我正在做一个模糊匹配项目，我发现了一个非常有趣的方法：awesome_cossim_top

I globally understood the definition but do not understand what is happening when we do fit_transform我全局理解定义，但不明白当我们做 fit_transform 时发生了什么

import pandas as pd
import sqlite3 as sql
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct
import re

def ngrams(string, n=3):
    string = re.sub(r'[,-./]|\sBD',r'', re.sub(' +', ' ',str(string)))
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

def awesome_cossim_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape

    idx_dtype = np.int32

    nnz_max = M*ntop

    indptr = np.zeros(M+1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)

    ct.sparse_dot_topn(
            M, N, np.asarray(A.indptr, dtype=idx_dtype),
            np.asarray(A.indices, dtype=idx_dtype),
            A.data,
            np.asarray(B.indptr, dtype=idx_dtype),
            np.asarray(B.indices, dtype=idx_dtype),
            B.data,
            ntop,
            lower_bound,
            indptr, indices, data)

    print('ct.sparse_dot_topn: ', ct.sparse_dot_topn)
    return csr_matrix((data,indices,indptr),shape=(M,N))

    def get_matches_df(sparse_matrix, A, B, top=100):
        non_zeros = sparse_matrix.nonzero()

        sparserows = non_zeros[0]
        sparsecols = non_zeros[1]

        if top:
            nr_matches = top
        else:
            nr_matches = sparsecols.size

        left_side = np.empty([nr_matches], dtype=object)
        right_side = np.empty([nr_matches], dtype=object)
        similairity = np.zeros(nr_matches)

        for index in range(0, nr_matches):
            left_side[index] = A[sparserows[index]]
            right_side[index] = B[sparsecols[index]]
            similairity[index] = sparse_matrix.data[index]

        return pd.DataFrame({'left_side': left_side,
                             'right_side': right_side,
                             'similairity': similairity})

Here is the script where I meet the confusion: Why should we use first fit_transform and then transform only with the SAME vectorizer.这是我遇到困惑的脚本：为什么我们应该先使用 fit_transform 然后只使用 SAME 矢量化器进行转换。 I tried to print a few output from vectorizer and matrix like print(vectorizer.get_feature_names()) but do not understand the logic.我试图从向量化器和矩阵打印一些输出，如 print(vectorizer.get_feature_names()) 但不理解逻辑。

Is anyone can help me to clarify ?有人可以帮我澄清一下吗？

Thanks a lot !!非常感谢！！

Col_clean = 'fruits_normalized'
Col_dirty = 'fruits'

#read table
data_dirty={f'{Col_dirty}':['I am an apple', 'You are an apple', 'Aple', 'Appls', 'Apples']}
data_clean= {f'{Col_clean}':['apple', 'pear', 'banana', 'apricot', 'pineapple']}

df_clean = pd.DataFrame(data_clean)
df_dirty = pd.DataFrame(data_dirty)

Name_clean = df_clean[f'{Col_clean}'].unique()
Name_dirty= df_dirty[f'{Col_dirty}'].unique()

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
clean_idf_matrix = vectorizer.fit_transform(Name_clean)
dirty_idf_matrix = vectorizer.transform(Name_dirty)

matches = awesome_cossim_top(dirty_idf_matrix, clean_idf_matrix.transpose(),1,0)
matches_df = get_matches_df(matches, Name_dirty, Name_clean, top = 0)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    matches_df.to_excel("output_apple.xlsx")

print('done')

Answer 1

TfidfVectorizer.fit_transform is used to create vocabulary from the training dataset and TfidfVectorizer.transform is used to map that vocabulary to test dataset so that the number of features in test data remain same as train data. TfidfVectorizer.fit_transform用于从训练数据集创建词汇表， TfidfVectorizer.transform用于将该词汇表映射到测试数据集，以便测试数据中的特征数量与训练数据相同。 Below example might help:下面的例子可能会有所帮助：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Create a dummy training data:创建一个虚拟训练数据：

train = pd.DataFrame({'Text' :['I am a data scientist','Cricket is my favorite sport', 'I work on Python regularly', 'Python is very fast for data mining', 'I love playing cricket'],
                      'Category' :['Data_Science','Cricket','Data_Science','Data_Science','Cricket']})

And a small test data:还有一个小测试数据：

test = pd.DataFrame({'Text' :['I am new to data science field', 'I play cricket on weekends', 'I like writing Python codes'],
                         'Category' :['Data_Science','Cricket','Data_Science']})

Create a TfidfVectorizer() object called vectorizer创建一个名为vectorizer的TfidfVectorizer()对象

vectorizer = TfidfVectorizer()

Fit it on the train data将其拟合到火车数据上

X_train = vectorizer.fit_transform(train['Text'])
print(vectorizer.get_feature_names())

#['am', 'cricket', 'data', 'fast', 'favorite', 'for', 'is', 'love', 'mining', 'my', 'on', 'playing', 'python', 'regularly', 'scientist', 'sport', 'very', 'work']

feature_names = vectorizer.get_feature_names()
df= pd.DataFrame(X.toarray(),columns=feature_names)

Now see what happens if you do the same on test dataset:现在看看如果你在测试数据集上做同样的事情会发生什么：

vectorizer_test = TfidfVectorizer()
X_test = vectorizer_test.fit_transform(test['Text'])
print(vectorizer_test.get_feature_names())

#['am', 'codes', 'cricket', 'data', 'field', 'like', 'new', 'on', 'play', 'python', 'science', 'to', 'weekends', 'writing']
feature_names_test = vectorizer_test.get_feature_names()
df_test= pd.DataFrame(X_test.toarray(),columns = feature_names_test)

It has created another vocabulary with test dataset, which has 14 unique words(columns) comparing to 18 words(columns) from train data.它使用测试数据集创建了另一个词汇表，与来自训练数据的 18 个词（列）相比，它有 14 个唯一的词（列）。

Now if you train a Machine Learning algorithm on your train data for text-classification and try to make predictions on your matrix from test data, it will fail and generate an error that features are different between the train and test data.现在，如果您在训练数据上训练机器学习算法进行text-classification并尝试根据测试数据对矩阵进行预测，它将失败并产生错误，即训练数据和测试数据之间的特征不同。

To overcome this error we do something like this in text-classification :为了克服这个错误，我们在text-classification做这样的事情：

X_test_from_train = vectorizer.transform(test['Text'])
feature_names_test_from_train = vectorizer.get_feature_names()
df_test_from_train = pd.DataFrame(X_test_from_train.toarray(),columns = feature_names_test_from_train)

Here you would have noticed that we didn't use the fit_transform command rather we used transform on test data, the reason is same that while making the predictions on test data, we only want to use the features which are similar in both train and test data so that we don't have feature mismatch error.在这里你会注意到我们没有使用fit_transform命令，而是对测试数据使用了transform ，原因相同，在对测试数据进行预测时，我们只想使用在训练和测试中相似的特征数据，以便我们没有特征不匹配错误。

Hope this helps!!希望这可以帮助！！

fit_transform、transform 和 TfidfVectorizer 的工作原理

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-12 10:53:02

fit_transform、transform 和 TfidfVectorizer 的工作原理

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-12 10:53:02

解决方案1
1 已采纳 2020-03-12 10:53:02