I have 2 lists of int and sparse matrix :
list_index = [1,1,2,3,3,4,4,5]
and
matrix_user = [sparse1, sparse2, sparse3, sparse4, sparse5, sparse6]
I want to have a list of sublist, each sublist is made of a list of int and a sparse matrix :
[ [[1,1,2,3,3], [sparse1, sparse2, sparse3, sparse4]] ,
[[4,4,5], [sparse5, sparse6]] ,
... ,
]
of a length ~ 90 ( to be run in parallel later on ), with each sublist[0]
containing non-overlapping values.
To cut the 2 input lists into 90 sections I do the following :
# cut the data into chunk to run in parallel
list_index = dfuser['idx'].tolist()
matrix_user = encoder.fit_transform(dfuser[['col1','col2']].values)
sizechunk = 90
sizelist = int(len(list_index)/sizechunk)
if len(list_index)%sizechunk!=0 : sizelist += 1
list_all = []
for i in range(sizechunk) :
if i*sizelist > len(list_index) : continue
if (i+1)*sizelist < len(list_index) : list_all.append( [list_index[i*sizelist:(i+1)*sizelist] , matrix_user_encoded.tocsr()[i*sizelist:(i+1)*sizelist] ] )
else : list_all.append( [list_index[i*sizelist:] , matrix_user_encoded.tocsr()[i*sizelist:] ])
This gives me a list of 90 chunks :
[ [[1,1,2,3],[sparse1, sparse2, sparse3]] ,
[[3,4,4,5],[sparse4, sparse5, sparse6]] ,
... ,
]
Then I filter in order to each sublist have different index value :
i=0
size_list = len(list_all)
while i<size_list-1 :
last_elem = list_all[i][0][len(list_all[i][0])-1]
first_elem = list_all[i+1][0][0]
first_sparse = list_all[i+1][1][0]
while first_elem==last_elem :
list_all[i][0].append(first_elem)
list_all[i][1] = sp.vstack((list_all[i][1],first_sparse))
list_all[i+1][0] = list_all[i+1][0][1:]
list_all[i+1][1] = list_all[i+1][1][1:]
if len(list_all[i+1][0])==0 :
list_all.remove(list_all[i+1])
size_list -= 1
if i+1==size_list : break
first_elem = list_all[i+1][0][0]
i +=1
It works, but as I have lots of input ( ~18 millions entries ), it takes 6 hours!
I need my program to run in less than 2 hours as it needs to be called multiple times a day. Does a python command exist to cut my 2 lists, depending on the pattern of the first sublist?
Thank you for your help!
from scipy.sparse import csr_matrix
list_index = [0,0,1,2,2,3,4,5,5,6,6,7,7,7,8]
arr = np.random.random(size=(len(list_index), 5))
arr[arr < .7] = 0
matrix_user = csr_matrix(arr)
chunksize = 4
to view the matrix you can use :
print(pd.SparseDataFrame(matrix_user))
After many improvement I found the solution : instead of cutting the 2 input lists into 90 sections then filtering in order each sublist have different index value, I extract all possible combination for list_index then cut into 90 chunks.
matrix_user = encoder.fit_transform(dfuser[['col1','col2']].values)
list_part_index = []
list_unique = list(dfuser.idx.unique())
for elem in list_unique :
list_part_index.append(dfuser[dfuser['idx']==elem].index[0])
nb_jump = int(len(list_unique)/90)
list_index = dfuser['idx'].tolist()
list_all = []
last_elem = list_part_index[0]
for elem in range(0, len(list_part_index),nb_jump) :
if list_part_index[elem]>0 :
list_all.append( [list_index[last_elem:list_part_index[elem]] , matrix_user_encoded.tocsr()[last_elem:list_part_index[elem]] ] )
last_elem = elem
list_all.append( [list_index[last_elem:] , matrix_user_encoded.tocsr()[last_elem:] ] )
my program now run in 22 min!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.