如何在python hcluster中使用稀疏矩阵？

Question

I'm trying to use hcluster library in python. 我正在尝试在python中使用hcluster库。 I have no enough python knowledges to use sparse matrix in hcluster. 我没有足够的python知识在hcluster中使用稀疏矩阵。 Please help me anybody. 请帮助我任何人。 So, that what I'm doing: 所以，我正在做的事情：

import os.path
import numpy
import scipy
import scipy.io 
from hcluster import squareform, pdist, linkage, complete 
from hcluster.hierarchy import linkage, from_mlab_linkage 
from numpy import savetxt 
from StringIO import StringIO

data.dmp contains matrix looks like: data.dmp包含矩阵看起来像：

and contains only upper-right part of matrix. 并且仅包含矩阵的右上部分。 I don't know how to spell it in english correctly :) so, all numbers upper than main diagonal so data.dmp contains : 1 0 1, 0 1 , 0 我不知道怎么用英语正确拼写它:)所以，所有数字都高于主对角线所以data.dmp包含：1 0 1,0,1,0

f = file('data.dmp','r')  
s = StringIO(f.readline()).getvalue()
f.close()

matrix = numpy.asarray(eval("["+s+"]"))

by unknown reason for me, hcluster uses inverted values, for example I use 0 if A!=C,and use 1 if A == D 由于我不明原因，hcluster使用反转值，例如，如果A！= C则使用0，如果A == D则使用1

sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")

linkage Y 联系Y.

Z = linkage(Y, method="complete")

So, matrix Z is what I need (if I correctly used hcluster?) 那么，矩阵Z是我需要的（如果我正确使用了hcluster？）

But I have next problems: 但我有下一个问题：

I want to use sparse matrix for the huge amount of input data, cause it's time consuming to generate input data like now, I need to import data to python from another language, thats why I need read text file. 我想对大量的输入数据使用稀疏矩阵，因为现在生成输入数据需要耗费时间，我需要从另一种语言将数据导入python，这就是为什么我需要读取文本文件。 Please kindly, python guru's suggest how to make it? 请问，python guru建议如何制作它？
To people that used python hcluster, I need to process huge amount of data, hundreds of rows, it's possible to do in hcluster? 对于那些使用python hcluster的人来说，我需要处理大量数据，数百行，这可以在hcluster中进行吗？ This algorithm realy produce correct HAC? 该算法真正产生正确的HAC？

Thank you for reading, I appreciate any help! 感谢您的阅读，感谢您的帮助！

Answer 1

Represent the inputs each as a dictionary, from feature name to value. 将每个输入表示为字典，从要素名称到值。 Zeros are not present in the dictionary. 字典中不存在零。

Compute the Y matrix yourself, not using the hcluster.pdist . 自己计算Y矩阵，而不是使用hcluster.pdist 。 The following code does sparse squared-error. 以下代码执行稀疏平方误差。 Squared-error is equivalent to cosine distance IF you l2-normalize all feature vectors. 如果你对所有特征向量进行标准化，则平方误差等于余弦距离。

def sqrerr(repr1, repr2):
    """
    Compute the sqrerr between two reprs.
    The reprs are each a dict from feature to feature value.
    """
    keys = frozenset(repr1.keys() + repr2.keys())
    sqrerr = 0.
    for k in keys:
        diff = repr1.get(k, 0.) - repr2.get(k, 0.)
        sqrerr += diff * diff
    return sqrerr

You should call sqrerr for every Y[i,j] element you want to compute. 您应该为要计算的每个Y [i，j]元素调用sqrerr。

Make Y a square matrix, and make sure that Y[i,j] == Y[j,i]. 使Y成方形矩阵，并确保Y [i，j] == Y [j，i]。 Use method hcluster.squareform to convert Y to a form that is good for hcluster.linkage . 使用方法hcluster.squareform将Y转换为适合hcluster.linkage 。

如何在python hcluster中使用稀疏矩阵？

问题描述

1 个解决方案

解决方案1
2 2011-01-17 19:27:05

如何在python hcluster中使用稀疏矩阵？

问题描述

1 个解决方案

解决方案1 2 2011-01-17 19:27:05

解决方案1
2 2011-01-17 19:27:05