简体   繁体   English

是否可以将此 Python 代码转换为 Cython?

[英]Is it possible to translate this Python code to Cython?

I'm actually looking to speed up #2 of this code by as much as possible, so I thought that it might be useful to try Cython.我实际上希望尽可能地加快这段代码的#2,所以我认为尝试 Cython 可能会很有用。 However, I'm not sure how to implement sparse matrix in Cython.但是,我不确定如何在 Cython 中实现稀疏矩阵。 Can somebody show how to / if it's possible to wrap it in Cython or perhaps Julia to make it faster?有人可以展示如何/是否可以将它包装在 Cython 或 Julia 中以使其更快?

#1) This part computes u_dict dictionary filled with unique strings and then enumerates them.

import scipy.sparse as sp
import numpy as np
from scipy.sparse import csr_matrix

full_dict = set(train1.values.ravel().tolist() + test1.values.ravel().tolist() + train2.values.ravel().tolist() + test2.values.ravel().tolist())
print len(full_dict)
u_dict= dict()
for i, q in enumerate(full_dict):
    u_dict[q] = i


shape = (len(full_dict), len(full_dict))
H = sp.lil_matrix(shape, dtype=np.int8)


def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

#2) I need to speed up this part
# train_full is pandas dataframe with two columns w1 and w2 filled with strings

H = load_sparse_csr('matrix.npz')

correlation_train = []
for idx, row in train_full.iterrows():
    if idx%1000 == 0: print idx
    id_1 = u_dict[row['w1']]
    id_2 = u_dict[row['w2']]
    a_vec = H[id_1].toarray() # these vectors are of length of < 3 mil.
    b_vec = H[id_2].toarray()
    correlation_train.append(np.corrcoef(a_vec, b_vec)[0][1])

While I contributed to How to properly pass a scipy.sparse CSR matrix to a cython function?虽然我贡献了如何正确地将 scipy.sparse CSR 矩阵传递给 cython 函数? quite some time ago, I doubt if cython is the way to go.很久以前,我怀疑cython是否cython Especially if you don't already have experience with numpy and cython .特别是如果您还没有使用numpycython经验。 cython gives the biggest speedup when you replace iterative calculations with code that it can translate to C without calling numpy or other python code.当您用无需调用numpy或其他python代码即可转换为 C 的代码替换迭代计算时, cython可提供最大的加速。 Throw pandas into the mix and you have an even bigger learning curve.pandas加入其中,您将获得更大的学习曲线。

And important parts of sparse code are already written with cython . sparse代码的重要部分已经用cython编写。

Without touching the cython issue I see a couple of problems.在不涉及 cython 问题的情况下,我看到了一些问题。

H is defined twice: H定义了两次:

H = sp.lil_matrix(shape, dtype=np.int8)
H = load_sparse_csr('matrix.npz')

That's either an oversight, or a failure to understand how Python variables are created and assigned.这要么是疏忽,要么是未能理解 Python 变量是如何创建和分配的。 The 2nd assignment replaces the first;第二个任务取代了第一个; thus the first does nothing.因此第一个什么都不做。 In addition the first just makes an empty lil matrix.此外,第一个只是制作一个空的lil矩阵。 Such a matrix could be filled iteratively;这样的矩阵可以迭代填充; while not fast it is the intended use of the lil format.虽然不快,但它是lil格式的预期用途。

The 2nd expression creates a new matrix from data saved in an npz file.第二个表达式根据保存在npz文件中的数据创建一个新矩阵。 That involves the numpy npz file loaded as well as the basic csr matrix creation code.这涉及加载的numpy npz 文件以及基本的csr矩阵创建代码。 And since the attributes are already in csr format, there's nothing for cython touch.并且由于属性已经是csr格式,因此cython touch 没有任何cython

You do have an iteration here - but over a Pandas dataframe:您确实在此处进行了迭代 - 但通过 Pandas 数据框:

for idx, row in train_full.iterrows():
    id_1 = u_dict[row['w1']]
    a_vec = H[id_1].toarray()

Looks like you are picking a particular row of H based on a dictionary/array look up.看起来您正在根据字典/数组查找选择H的特定行。 Sparse matrix indexing is slow compared to dense matrix indexing.与密集矩阵索引相比,稀疏矩阵索引速度较慢。 That is, if Ha = H.toarray() fits your memory then,也就是说,如果Ha = H.toarray()符合你的记忆,那么,

a_vec = Ha[id_1,:]

will be a lot faster.会快很多。

Faster selection of rows (or columns) from a sparse matrix has been asked before.之前已经要求从稀疏矩阵中更快地选择行(或列)。 If you could work directly with the sparse data of a row I could recommend something more direct.如果您可以直接处理一行的稀疏数据,我可以推荐更直接的方法。 But you want a dense array that you can pass to np.corrcoef , so we'd have to implement the toarray step as well.但是你想要一个可以传递给np.corrcoef的密集数组,所以我们也必须实现toarray步骤。

How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster? 如何更快地读取/遍历/切片 Scipy 稀疏矩阵(LIL、CSR、COO、DOK)?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM