[英]Python - The best way to read a sparse file into a sparse matrix
Would like to know if there is a more efficient way to load file content into a sparse matrix. 想知道是否有更有效的方法将文件内容加载到稀疏矩阵中。 The following code reads from a big file (8GB), which has mostly zero values (very sparse), and then does some processing on each line read.
以下代码从一个大文件(8GB)中读取,该文件大部分为零值(非常稀疏),然后对每行读取进行一些处理。 I would like to perform arithmetic operations on it efficiently, so I try to store the lines as a sparse matrix.
我想有效地对它进行算术运算,所以我尝试将这些行存储为稀疏矩阵。 Since the number of lines in file is not known in advance, and array/matrix are not dynamic, I have to first store it in a list and then transform is to a csr_matrix.
由于文件中的行数预先不知道,并且数组/矩阵不是动态的,我必须先将其存储在列表中,然后转换为csr_matrix。 This phase ("
X = csr_matrix(X)
") takes a lot of time and memory. 这个阶段(“
X = csr_matrix(X)
”)需要大量的时间和内存。
Any suggestions? 有什么建议么?
import numpy as np
from scipy.sparse import csr_matrix
from datetime import datetime as time
global header_names; header_names = []
def readOppFromFile(filepath):
print "Read Opportunities From File..." + str(time.now())
# read file header - feature names separated with commas
global header_names
with open(filepath, "r") as f:
i=0
header_names = f.readline().rstrip().split(',')
for line in f:
# replace empty string with 0 in comma-separated string. In addition, clean null values (replace with 0)
yield [(x.replace('null', '0') if x else 0) for x in line.rstrip().split(',')]
i += 1
print "Number of opportunities read from file: %s" % str(i)
def processOpportunities(opp_data):
print "Process Opportunities ..." + str(time.now())
# Initialization
X = []
targets_array = []
global header_names
for opportunity in opp_data:
# Extract for each opportunity it's target variable, save it in a special array and then remove it
target = opportunity[-1] # Only last column
targets_array.append(target)
del opportunity[-1] # Remove last column
X.append(opportunity)
print " Starting to transform to a sparse matrix" + str(time.now())
X = csr_matrix(X)
print "Finished transform to a sparse matrix " + str(time.now())
# The target variable of each impression
targets_array = np.array(targets_array, dtype=int)
print "targets_array" + str(time.now())
return X, targets_array
def main():
print "STRAT -----> " + str(time.now())
running_time = time.now()
opps_data = readOppFromFile(inputfilename)
features, target = processOpportunities(opps_data)
if __name__ == '__main__':
""" ################### GLOBAL VARIABLES ############################ """
inputfilename = 'C:/somefolder/trainingset.working.csv'
""" ################### START PROGRAM ############################ """
main()
Updated: The dimensions of the matrix are not constant, they depend on the input file and may change in each run of the program. 更新:矩阵的尺寸不是常量,它们取决于输入文件,并且可能在程序的每次运行中发生变化。 For a small sample of my input, see here .
有关我输入的一小部分示例,请参阅此处 。
You can construct a sparse matrix directly, if you keep track of the nonzeros manually: 如果您手动跟踪非零,则可以直接构造稀疏矩阵:
X_data = []
X_row, X_col = [], []
targets_array = []
for row_idx, opportunity in enumerate(opp_data):
targets_array.append(int(opportunity[-1]))
row = np.array(map(int, opportunity[:-1]))
col_inds, = np.nonzero(row)
X_col.extend(col_inds)
X_row.extend([row_idx]*len(col_inds))
X_data.extend(row[col_inds])
print " Starting to transform to a sparse matrix" + str(time.now())
X = coo_matrix((X_data, (X_row, X_col)), dtype=int)
print "Finished transform to a sparse matrix " + str(time.now())
This constructs the matrix in COO format, which is easy to transform into whatever format you like: 这构造了COO格式的矩阵,很容易转换成你喜欢的任何格式:
X = X.tocsr()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.