简体   繁体   English

稀疏矩阵上的scikit-learn Normalizer过程

[英]scikit-learn Normalizer process on sparse matrix

I m try to normalize data in sparse matrix (matrix is in TF format). 我尝试规范化稀疏矩阵中的数据(矩阵为TF格式)。 i had a doubt, 我有一个疑问

It's right use sklearn.preprocessing.Normalizer just for normalize my matrix? 正确使用sklearn.preprocessing.Normalizer来规范我的矩阵吗?

Does it make sense to normalize and use the TF matrix to cluster? 归一化并使用TF矩阵进行聚类是否有意义?

My matrix is in this way: 我的矩阵是这样的:

 (0, 0) 1
 (7, 0) 1
 (13, 0)    1
 (31, 0)    4
 (97, 0)    3
 (99, 0)    1

i use this code, from the sklearn API: 我从sklearn API使用以下代码:

 transformer = Normalizer().fit(sparse_matrix) # fit does nothing.

 Normalizer(copy=True, norm='l2')
 transformer.transform(sparse_matrix)  

Where sparse matrix, is my TF matrix. 稀疏矩阵是我的TF矩阵。

The output is this one: 输出是这样的:

 (0, 0) 0.04822428221704121
 (0, 1) 0.04822428221704121
 (0, 2) 0.04822428221704121
 (0, 3) 0.14467284665112365
 (0, 4) 0.04822428221704121
 (0, 5) 0.04822428221704121
 (0, 6) 0.09644856443408242
 (0, 7) 0.19289712886816485

Since it's the first time I've done this, I do not want to be wrong. 由于这是我第一次这样做,所以我不想错。 On these new data I want to apply clustering to see differences between normalization and TF-IDF. 对于这些新数据,我想应用聚类以查看规范化和TF-IDF之间的差异。 Excuse me if this question sounds silly, but I'm trying to learn from zero. 如果这个问题听起来很傻,请原谅,但我正在尝试从零开始学习。

The tfidf matrix produced by sklearn is already normalized in an appropriate way. sklearn生成的tfidf矩阵已经以适当的方式进行了归一化。

The usual normalization is to unit l2 length, in order to make the dot product, Euclidean distance, and cosine return the same ranking. 通常的归一化是将单位长度设为l2,以使点积,欧几里得距离和余弦值返回相同的等级。 From a theoretical point of view (and you should always consider the why ), this corresponds to normalizing the document length : a document that is twice the concatenation of another document will produce the same vector. 从理论的角度来看(并且您应该始终考虑为什么 ),这对应于规范化文档的长度 :两倍于另一个文档的串联的文档将产生相同的向量。

Think a few minutes how to check that the matrix is indeed normalized this way. 考虑几分钟,如何检查矩阵是否确实通过这种方式进行了归一化。 That is a one line expression involving dot . 那是涉及dot单行表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM