简体   繁体   English

如何在python hcluster中使用稀疏矩阵?

[英]How to use sparse matrix in python hcluster?

I'm trying to use hcluster library in python. 我正在尝试在python中使用hcluster库。 I have no enough python knowledges to use sparse matrix in hcluster. 我没有足够的python知识在hcluster中使用稀疏矩阵。 Please help me anybody. 请帮助我任何人。 So, that what I'm doing: 所以,我正在做的事情:

import os.path
import numpy
import scipy
import scipy.io 
from hcluster import squareform, pdist, linkage, complete 
from hcluster.hierarchy import linkage, from_mlab_linkage 
from numpy import savetxt 
from StringIO import StringIO 

data.dmp contains matrix looks like: data.dmp包含矩阵看起来像:

  A B C D
A 0 1 0 1 
B 1 0 0 1 
C 0 0 0 0 
D 1 1 0 0 

and contains only upper-right part of matrix. 并且仅包含矩阵的右上部分。 I don't know how to spell it in english correctly :) so, all numbers upper than main diagonal so data.dmp contains : 1 0 1, 0 1 , 0 我不知道怎么用英语正确拼写它:)所以,所有数字都高于主对角线所以data.dmp包含:1 0 1,0,1,0

f = file('data.dmp','r')  
s = StringIO(f.readline()).getvalue()
f.close()

matrix = numpy.asarray(eval("["+s+"]"))

by unknown reason for me, hcluster uses inverted values, for example I use 0 if A!=C,and use 1 if A == D 由于我不明原因,hcluster使用反转值,例如,如果A!= C则使用0,如果A == D则使用1

sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")

linkage Y 联系Y.

Z = linkage(Y, method="complete")

So, matrix Z is what I need (if I correctly used hcluster?) 那么,矩阵Z是我需要的(如果我正确使用了hcluster?)

But I have next problems: 但我有下一个问题:

  1. I want to use sparse matrix for the huge amount of input data, cause it's time consuming to generate input data like now, I need to import data to python from another language, thats why I need read text file. 我想对大量的输入数据使用稀疏矩阵,因为现在生成输入数据需要耗费时间,我需要从另一种语言将数据导入python,这就是为什么我需要读取文本文件。 Please kindly, python guru's suggest how to make it? 请问,python guru建议如何制作它?

  2. To people that used python hcluster, I need to process huge amount of data, hundreds of rows, it's possible to do in hcluster? 对于那些使用python hcluster的人来说,我需要处理大量数据,数百行,这可以在hcluster中进行吗? This algorithm realy produce correct HAC? 该算法真正产生正确的HAC?

Thank you for reading, I appreciate any help! 感谢您的阅读,感谢您的帮助!

Represent the inputs each as a dictionary, from feature name to value. 将每个输入表示为字典,从要素名称到值。 Zeros are not present in the dictionary. 字典中不存在零。

Compute the Y matrix yourself, not using the hcluster.pdist . 自己计算Y矩阵,而不是使用hcluster.pdist The following code does sparse squared-error. 以下代码执行稀疏平方误差。 Squared-error is equivalent to cosine distance IF you l2-normalize all feature vectors. 如果你对所有特征向量进行标准化,则平方误差等于余弦距离。

def sqrerr(repr1, repr2):
    """
    Compute the sqrerr between two reprs.
    The reprs are each a dict from feature to feature value.
    """
    keys = frozenset(repr1.keys() + repr2.keys())
    sqrerr = 0.
    for k in keys:
        diff = repr1.get(k, 0.) - repr2.get(k, 0.)
        sqrerr += diff * diff
    return sqrerr

You should call sqrerr for every Y[i,j] element you want to compute. 您应该为要计算的每个Y [i,j]元素调用sqrerr。

Make Y a square matrix, and make sure that Y[i,j] == Y[j,i]. 使Y成方形矩阵,并确保Y [i,j] == Y [j,i]。 Use method hcluster.squareform to convert Y to a form that is good for hcluster.linkage . 使用方法hcluster.squareform将Y转换为适合hcluster.linkage

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM