简体   繁体   English

如何在Python中找到与sklearn管道的最佳匹配

[英]How to find best match with sklearn pipeline in Python

I've got a Pipeline setup using a TfidfVectorizer and TruncatedSVD. 我已经使用TfidfVectorizer和TruncatedSVD进行了管道设置。 I train the models with sklearn and calculate the distance between two vectors using the cosine similarity. 我使用sklearn训练模型,并使用余弦相似度计算两个向量之间的距离。 Here's my code: 这是我的代码:

def create_scikit_corpus(leaf_names=None):

    vectorizer = TfidfVectorizer(
        tokenizer=Tokenizer(),
        stop_words='english',
        use_idf=True,
        smooth_idf=True
    )

    svd_model = TruncatedSVD(n_components=300,
                             algorithm='randomized',
                             n_iterations=10,
                             random_state=42)
    svd_transformer = Pipeline([('tfidf', vectorizer),
                                ('svd', svd_model)])

    svd_matrix = svd_transformer.fit_transform(leaf_names)

    logging.info("Models created")

    test = "This is a test search query."
    query_vector = svd_transformer.transform(test)
    distance_matrix = pairwise_distances(query_vector, svd_matrix, metric='cosine')


    return svd_transformer, svd_matrix

The thing is that I'm not sure what to do once I have the distance_matrix variable. 问题是,一旦我拥有distance_matrix变量,我就不确定该怎么做。 I guess I'm kinda confused on exactly what that is. 我想我对这到底是什么感到困惑。

I'm trying to find which document matches best with my query. 我正在尝试查找与我的查询最匹配的文档。 Thanks for a push in the right direction! 感谢您朝着正确的方向前进!

Once you have the distance_matrix computed, you can find the closest singular vector according to the cosine similarity... And that might be the reason you are confused: what does this singular vector represent? 计算完distance_matrix之后,您可以根据余弦相似度找到最接近的奇异矢量。这可能就是您感到困惑的原因:这个奇异矢量代表什么?

The problem is that this answer is not straightforward, the singular vector is usually not a document in your corpus. 问题是这个答案不是很简单,奇异向量通常不是您主体中的文档。

If what you want is the best match as in "the document from your corpus that is the most similar to this one", there is something simpler to do: pick the closest document according to cosine similarity. 如果您想要的是“与您的语料库中的文档最相似的文档”中的最佳匹配项 ,则可以执行更简单的操作:根据余弦相似度选择最接近的文档。 You do not need SVD for this approach. 您无需为此方法使用SVD。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM