简体   繁体   English

在Python中表示稀疏矩阵并将其存储到磁盘

[英]Representing a sparse matrix in Python and storing to disk

I have a large number of time series (millions) of varying length that I plan to do a clustering analysis on (probably using the sklearn implementation of kmeans). 我计划对大量不同长度的时间序列(数百万个)进行聚类分析(可能使用kmeans的sklearn实现)。

For my purposes I need to align the time series (such that the maximum value is centered, pad them with zeros (so they are all the same length), and normalize them before I can do the clustering analysis. So as a trivial example, something like: 出于我的目的,我需要对齐时间序列(以使最大值居中,用零填充(以使它们都具有相同的长度),然后对其进行归一化,然后再进行聚类分析。例如,就像是:

[5, 0, 7, 10, 6]

Would become something like 会变成像

[0, 0.5, 0, 0.7, 1, 0.6, 0, 0, 0]

In the real data, the raw time series are of length 90, and the padded/aligned/normed time series are of length 181. Of course, we have lots of zeros here, so a sparse matrix seems the ideal way of storing the data. 在实际数据中,原始时间序列的长度为90,填充/对齐/归一化的时间序列的长度为181。当然,这里有很多零,因此稀疏矩阵似乎是存储数据的理想方式。

Based on this, I have two related questions: 基于此,我有两个相关的问题:

1 - How best to store these in memory? 1-如何最好地将它们存储在内存中? My current, inefficient method is to calculate the dense normed/aligned/padded matrix for each time series and write to a simple text file for storage purposes, then separately read that data into a scipy sparse lil matrix: 我当前的效率低下的方法是为每个时间序列计算密集的归一化/对齐/填充矩阵,并写入一个简单的文本文件以进行存储,然后分别将该数据读取到一个稀疏稀疏的lil矩阵中:

rows, columns = N, 181
matrix = scipy.sparse.lil_matrix( (rows, columns) )

for i,line in enumerate(open(file_containing_dense_matrix_data)):
    # The first two values in each line are metadata
    line = map(float,line.strip().split(',')[2:])

matrix[i]=line

This is both slow and more memory intensive than I had hoped. 这既慢,又比我希望的要占用更多的内存。 Is there a preferred method? 有没有首选的方法?

2 - Is there a better way to store the time series on disk? 2-是否有更好的方法将时间序列存储在磁盘上? I have yet to find an efficient means to write the data to disk directly as a sparse matrix that I can read (relatively) quickly into memory at a later time. 我还没有找到一种有效的方法将数据作为稀疏矩阵直接写入磁盘,以后可以将其(相对)快速读取到内存中。

My ideal response here is a method that addresses both questions, ie a method to store the dense matrix rows directly into a sparse data structure, and to efficiently read/write the data to/from disk. 在这里,我的理想回答是解决两个问题的方法,即将密集矩阵行直接存储到稀疏数据结构中并有效地将数据读/写到磁盘的方法。

我建议对稀疏矩阵使用pandas支持 ,然后对它的IO工具使用例如HDFS进行写入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM