简体   繁体   English

python:我可以在不(明确)使用整数索引的情况下使用稀疏矩阵表示吗?

[英]python: can I have a sparse matrix representation without (explicitly) using integer indices?

I have a dataset that is essentially a sparse binary matrix that represents relationships between elements of two sets.我有一个数据集,它本质上是一个稀疏二进制矩阵,表示两个集合的元素之间的关系。 For example, let the 1st set be people (represented by their names), eg somehting like this:例如,让第一组是人(用他们的名字表示),例如这样的东西:

people = set(['john','jane','mike','joe'])

and the 2nd set be a bunch of binary attributes, eg第二组是一堆二进制属性,例如

attrs = set(['likes_coffee','has_curly_hair','has_dark_hair','drives_car','man_u_fan'])

The dataset is represented by a tab-separated data file that assigns some of the attributes to each person, eg数据集由制表符分隔的数据文件表示,该文件将某些属性分配给每个人,例如

john    likes_coffee
john    drives_car
john    has_curly_hair
jane    has_curly_hair
jane    man_u_fan
...

attrs has about 30,000 elements, people can be as big 6,000,000 , but the data is sparse, ie each person has at most 30-40 attributes attrs大约有30,000元素, people可以有6,000,000大,但是数据很稀疏,即每个人最多有 30-40 个属性

I am looking for a data structure/class in python that would allow me:我正在 python 中寻找一个数据结构/类,它可以让我:

  • To quickly create a matrix object representing the dataset from the corresponding data file从相应的数据文件中快速创建表示数据集的matrix对象
  • To be able to quickly extract individual elements of the matrix as well as blocks of its rows and columns.能够快速提取矩阵的单个元素以及其行和列的块。 For example, I want to answer questions like例如,我想回答这样的问题
    • "Give me a list of all people with {'has_curly_hair','likes_coffee','man_u_fan'} " “给我一份所有有{'has_curly_hair','likes_coffee','man_u_fan'}人的名单”
    • "Give me a union of attributes of {'mike','joe'} " “给我一个{'mike','joe'}属性的联合”

My current implementation uses a pair of arrays for the two sets and a scipy sparse matrix.我当前的实现使用了两个数组的一对数组和一个scipy稀疏矩阵。 So if因此,如果

people = ['john','jane','mike','joe']
attrs = ['likes_coffee','has_curly_hair','has_dark_hair','drives_car','man_u_fan']

then I would create a sparse matrix data of size 4 X 5 and the sample data shown above would correspond to elements然后我会创建一个大小为4 X 5的稀疏矩阵data ,上面显示的样本数据将对应于元素

data[0,0]
data[0,3]
data[0,1]
data[1,1]
data[1,4]
...

I also maintain two inverse indices so that I don't have to invoke people.index('mike') or attrs.index('has_curly_hair') too often我还维护两个反向索引,这样我就不必经常调用people.index('mike')attrs.index('has_curly_hair')

This works OK but I have to maintain the indices explicitly.这工作正常,但我必须明确维护索引。 This is cumbersome, for instance, when I have two datasets with different sets of people and/or attributes and I need to match rows/columns corresponding to the same person/attribute from the two sparse matrices.这很麻烦,例如,当我有两个数据集具有不同的人和/或属性集,并且我需要从两个稀疏矩阵中匹配与同一个人/属性对应的行/列时。

So is there an aternative that would allow me to avoid using integer indices and instead use actual elements of the two sets to extract rows/columns, ie something like那么是否有一种替代方法可以让我避免使用整数索引,而是使用两组的实际元素来提取行/列,即类似

data['john',:]  # give me all attributes of 'john'
data[:,['has_curly_hair','drives_car']] # give me all people who 'has_curly_hair' or 'drives_car'

? ?

Assuming that no library does exactly what you want, you can create your own class SparseMatrix and overload the operator [] .假设没有库完全符合您的要求,您可以创建自己的类SparseMatrix算符[] Heres is one way to do it (the constructor might be different to what you want to have):这是一种方法(构造函数可能与您想要的不同):

class SparseMatrix():
    def __init__(self, x_label, y_label):
        self.data = {}
        for x,y in zip(x_label,y_label):
            print x,y
            self.data[x] = {}
            for attr in y:
                self.data[x][attr] = 1
        return

    def __getitem__(self, index):
        x,y = index
        if type(x) is str:
            if type(y) is str:
                return 1 if y in self.data[x] else 0
            if type(y) is slice:
                return self.data[x].keys()
        if type(x) is slice:
            if type(y) is str:
                res = []
                for key in self.data.keys():
                    if y in self.data[key]:
                        res.append(key)
                return res
            if type(y) is list:
                res = []
                for attr in y:
                    res += self.__getitem__((x,attr))
                return res

And in the REPL, I get:在 REPL 中,我得到:

> data = SparseMatrix(['john','jane','mike','joe'],[['likes_coffee','has_curly_hair'],['has_dark_hair'],['drives_car'],['man_u_fan']])

> data['john',:]
['has_curly_hair', 'likes_coffee']

> data[:,['has_curly_hair','drives_car']]
['john', 'mike']

One of the sparse formats is actually a dictionary. sparse格式之一实际上是字典。 A dok_matrix is a dictionary subclass, where the keys are of the form (1,100) , (30,334) . dok_matrix是字典子类,其中键的形式为(1,100) , (30,334) That is tuples of the i,j indices.那是 i,j 索引的元组。

But I found out in other SO questions that access to elements of such a format is actually slower than regular dictionary access.但我在其他 SO 问题中发现,访问这种格式的元素实际上比常规字典访问慢。 That is d[1,100] is slower than the equivalent dd[(1,100)] .也就是说d[1,100]比等效的dd[(1,100)]慢。 I found that it was fastest to build a regular dictionary, and use update to add the values to the sparse dok .我发现构建普通字典最快,并使用update将值添加到稀疏dok

But dok is useful if you want to transform the matrix to one of the computationally friendly formats like csr .但是如果您想将矩阵转换为一种计算友好的格式(如csrdok很有用。 And of course you can access a sparse matrix with d[100,:] , something which is impossible with a regular dictionary.当然,您可以使用d[100,:]访问稀疏矩阵,这是常规字典无法实现的。

For some uses a default dictionary can be quick and useful.对于某些用途,默认字典可以快速且有用。 In other words a dictionary where the keys are 'people', and the values are lists or other dictionaries with 'attribute' keys.换句话说,一个字典,其中键是“人”,值是列表或其他带有“属性”键的字典。

Anyways, sparse matrix does not have provision word indices.无论如何,稀疏矩阵没有提供词索引。 Remember, it's roots are in linear algebra, calculating matrix products and inverses of large sparse numeric matrices.请记住,它的根源在于线性代数,计算矩阵乘积和大型稀疏数字矩阵的逆。 It's use for text databases is relatively recent.它用于文本数据库是相对较新的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM