有效地构造Numpy中的稀疏二乘矩阵

Question

I'm trying to load this CSV file into a sparse numpy matrix, which would represent the biadjacency matrix of this user-to-subreddit bipartite graph: http://figshare.com/articles/reddit_user_posting_behavior/874101 我正在尝试将此CSV文件加载到稀疏的numpy矩阵中，该矩阵将表示此用户到subreddit双向图的矛盾矩阵： http ://figshare.com/articles/reddit_user_posting_behavior/874101

Here's a sample: 这是一个示例：

603,politics,trees,pics
604,Metal,AskReddit,tattoos,redditguild,WTF,cocktails,pics,funny,gaming,Fitness,mcservers,TeraOnline,GetMotivated,itookapicture,Paleo,trackers,Minecraft,gainit
605,politics,IAmA,AdviceAnimals,movies,smallbusiness,Republican,todayilearned,AskReddit,WTF,IWantOut,pics,funny,DIY,Frugal,relationships,atheism,Jeep,Music,grandrapids,reddit.com,videos,yoga,GetMotivated,bestof,ShitRedditSays,gifs,technology,aww

There are 876,961 lines (one per user) and 15,122 subreddits and a total of 8,495,597 user-to-subreddit associations. 有876,961行（每位用户一个）和15,122个子Reddit，以及总共8,495,597个用户到Subreddit关联。

Here's the code which I have right now, and which takes 20 minutes to run on my MacBook Pro: 这是我现在拥有的代码，在我的MacBook Pro上运行需要20分钟：

import numpy as np
from scipy.sparse import csr_matrix 

row_list = []
entry_count = 0
all_reddits = set()
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for x in f:
        pieces = x.rstrip().split(",")
        user = pieces[0]
        reddits = pieces[1:]
        entry_count += len(reddits)
        for r in reddits: all_reddits.add(r)
        row_list.append(np.array(reddits))

reddits_list = np.array(list(all_reddits))

# 5s to get this far

rows = np.zeros((entry_count,))
cols = np.zeros((entry_count,))
data =  np.ones((entry_count,))
i=0
user_idx = 0
for row in row_list:
    for reddit_idx in np.nonzero(np.in1d(reddits_list,row))[0]:
        cols[i] = user_idx
        rows[i] = reddit_idx
        i+=1
    user_idx+=1
adj = csr_matrix( (data,(rows,cols)), shape=(len(reddits_list), len(row_list)) )

It seems hard to believe that this is as fast as this can go... Loading the 82MB file into a list of lists takes 5s but building out the sparse matrix takes 200 times that. 似乎很难相信这是可以做到的。...将82MB文件加载到列表中需要5秒钟，但是构建稀疏矩阵需要200倍。 What can I do to speed this up? 我该怎么做才能加快速度？ Is there some file format that I can convert this CSV into in less than 20min that would import more quickly? 是否可以在不到20分钟的时间内将某些CSV文件转换成可以更快导入的文件格式？ Is there some obviously-expensive operation I'm doing here that's not good? 我在这里进行一些明显昂贵的操作是不好的吗？ I've tried building a dense matrix and I've tried creating a lil_matrix and a dok_matrix and assigning the 1 's one at a time, and that's no faster. 我尝试构建一个密集矩阵，并尝试创建一个lil_matrix和dok_matrix并一次分配1 dok_matrix并没有更快。

Answer 1

Couldn't sleep, tried one last thing... I was able to get it down to 10 seconds this way, in the end: 无法入睡，尝试了最后一件事...这样我终于可以将它降低到10秒：

import numpy as np
from scipy.sparse import csr_matrix 

user_ids = []
subreddit_ids = []
subreddits = {}
i=0
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for line in f:
        for sr in line.rstrip().split(",")[1:]: 
            if sr not in subreddits: 
                subreddits[sr] = len(subreddits)
            user_ids.append(i)
            subreddit_ids.append(subreddits[sr])
        i+=1

adj = csr_matrix( 
    ( np.ones((len(userids),)), (np.array(subreddit_ids),np.array(user_ids)) ), 
    shape=(len(subreddits), i) )

Answer 2

For a start you could replace in the inner for with something like: 首先，您可以将内部替换for ：

reddit_idx = np.nonzero(np.in1d(reddits_list,row))[0]
sl = slice(i,i+len(reddit_idx))
cols[sl] = user_idx
rows[sl] = reddit_idx
i = sl.stop

The use of nonzero(in1d()) to find the matches looks good, but I haven't explored alternatives. 使用nonzero(in1d())来查找匹配项看起来不错，但是我没有探索替代方法。 An alternative to assignment via slices is to extend lists, but that is probably slower, especially with many rows. 通过切片分配的一种替代方法是extend列表，但这可能会比较慢，尤其是对于许多行。

Constructing the rows, cols is by far the slowest part. 构造行时，cols是迄今为止最慢的部分。 The call to csr_matrix is minor. 对csr_matrix的调用是次要的。

Since there are a lot more rows (users) than subreddits, it might be worth collecting, for each subreddit, a list of user ids. 由于行（用户）比子reddit多得多，因此可能值得为每个子reddit收集一个用户ID列表。 You are already collecting subreddits in a set. 您已经在集合中收集了子reddit。 You could, instead, collecte them in a default dictionary, and build the matrix from that. 相反，您可以将它们收集在默认词典中，然后从中构建矩阵。 When tested on your 3 lines replicated 100000 times it is noticeably faster. 在您的3条线上进行测试时，复制了100000次，速度明显加快。

from collections import defaultdict
red_dict = defaultdict(list)
user_idx = 0
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for x in f:
        pieces = x.rstrip().split(",")
        user = pieces[0]
        reddits = pieces[1:]
        for r in reddits:
            red_dict[r] += [user_idx]
        user_idx += 1

print 'done 2nd'
x =  red_dict.values()
adj1 = sparse.lil_matrix((len(x), user_idx), dtype=int)
for i,j in enumerate(x):
    adj1[i,j] = 1

有效地构造Numpy中的稀疏二乘矩阵

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-11-27 05:39:26

解决方案2
1 2014-11-27 03:39:19

有效地构造Numpy中的稀疏二乘矩阵

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-11-27 05:39:26

解决方案2 1 2014-11-27 03:39:19

解决方案1
2 已采纳 2014-11-27 05:39:26

解决方案2
1 2014-11-27 03:39:19