python：将（字符串）集列表转换为 scipy csr_matrix

Question

Suppose I have this list of sets:假设我有这个集合列表：

db = [{"bread", "butter", "milk"}, 
      {"eggs", "milk", "yogurt"},
      {"bread", "cheese", "eggs", "milk"}, 
      {"eggs", "milk", "yogurt"},
      {"cheese", "milk", "yogurt"}]

How do I convert this into a scipy sparse csr_matrix?如何将其转换为 scipy 稀疏 csr_matrix？ Its' expected output is the following:它的预期输出如下：

[[1., 1. 0., 0., 1., 0.],
 [0., 0., 0., 1., 1., 1.],
 [1., 0., 1., 1., 1., 0.],
 [0., 0., 0., 1., 1., 1.],
 [0., 0., 1., 0., 1., 1.]]

I tried hardcoding it so I could digest it further but i can't seem to understand.我尝试对它进行硬编码，以便我可以进一步消化它，但我似乎无法理解。 My code is:我的代码是：

indptr = np.array([0, 3, 6, 10, 13, 16])
data = np.array(["bread", "butter", "milk", "eggs", "milk", "yogurt",
                "bread", "cheese", "eggs", "milk","eggs", "milk", "yogurt",
                "cheese", "milk", "yogurt"])
indices = np.array([0, 1, 4, 3, 4, 5, 0, 2, 3, 4, 3, 4, 5, 2, 4, 5])
csr_matrix((data, indices, indptr), dtype=int).toarray()

I can't seem to make it work.我似乎无法让它发挥作用。 Is there a better way of implementing this?有没有更好的方法来实现这个？

Answer 1

Setup:设置：

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

db = [{"bread", "butter", "milk"}, 
      {"eggs", "milk", "yogurt"},
      {"bread", "cheese", "eggs", "milk"}, 
      {"eggs", "milk", "yogurt"},
      {"cheese", "milk", "yogurt"}]

all_products = set()
for SET in db:
    all_products |= SET
sorted_products = sorted(all_products)

Method 2 (no pandas):方法2（没有熊猫）：

First, you make translator首先，你做翻译

d = dict()
for i, prod in enumerate(sorted_products):
    d[prod] = i

{'bread': 0, 'butter': 1, 'cheese': 2, 'eggs': 3, 'milk': 4, 'yogurt': 5}

Then, you make full matrix and populate it然后，你制作完整的矩阵并填充它

template = np.zeros(len(all_products) * len(db), dtype=int).reshape((len(db), len(all_products)))
for j, line in enumerate(db):
    for prod in line:
        template[j, d[prod]] = 1

array([[1, 1, 0, 0, 1, 0],
       [0, 0, 0, 1, 1, 1],
       [1, 0, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 1],
       [0, 0, 1, 0, 1, 1]])

and lastly convert it to sparse matrix最后将其转换为稀疏矩阵

matrix = csr_matrix(template)

  (0, 0)    1
  (0, 1)    1
  (0, 4)    1
  (1, 3)    1
  (1, 4)    1
  (1, 5)    1
  (2, 0)    1
  (2, 2)    1
  (2, 3)    1
  (2, 4)    1
  (3, 3)    1
  (3, 4)    1
  (3, 5)    1
  (4, 2)    1
  (4, 4)    1
  (4, 5)    1

#<5x6 sparse matrix of type '<class 'numpy.longlong'>'
#   with 16 stored elements in Compressed Sparse Row format>

Method 1 (pandas):方法一（熊猫）：

df = pd.DataFrame(index=sorted_products, columns=range(len(db)))
print(df)

Gives you empty dataframe给你空的数据框

          0       1       2       3       4
yogurt  NaN     NaN     NaN     NaN     NaN
butter  NaN     NaN     NaN     NaN     NaN
bread   NaN     NaN     NaN     NaN     NaN
milk    NaN     NaN     NaN     NaN     NaN
cheese  NaN     NaN     NaN     NaN     NaN
eggs    NaN     NaN     NaN     NaN     NaN

Then you add sets然后你添加集合

for i in range(len(db)):
    df[i] = pd.Series([1]*len(db[i]), index=list(db[i]))

          0       1       2       3       4
yogurt  NaN     1.0     NaN     1.0     1.0
butter  1.0     NaN     NaN     NaN     NaN
bread   1.0     NaN     1.0     NaN     NaN
milk    1.0     1.0     1.0     1.0     1.0
cheese  NaN     NaN     1.0     NaN     1.0
eggs    NaN     1.0     1.0     1.0     NaN

Next, you fill NaN values with zeroes接下来，用零填充 NaN 值

data = df.fillna(0)

And at the end you convert it to sparse matrix最后将其转换为稀疏矩阵

from scipy.sparse import csr_matrix
matrix = csr_matrix(data)
print(matrix)

Outputs:输出：

#<6x5 sparse matrix of type '<class 'numpy.longlong'>'
#   with 16 stored elements in Compressed Sparse Row format>
  (0, 2)    1
  (0, 4)    1
  (1, 1)    1
  (1, 2)    1
  (1, 3)    1
  (2, 0)    1
  (2, 1)    1
  (2, 2)    1
  (2, 3)    1
  (2, 4)    1
  (3, 1)    1
  (3, 3)    1
  (3, 4)    1
  (4, 0)    1
  (4, 2)    1
  (5, 0)    1

python：将（字符串）集列表转换为 scipy csr_matrix

问题描述

1 个解决方案

解决方案1
1 2020-09-26 13:55:55

Method 2 (no pandas):方法2（没有熊猫）：

Method 1 (pandas):方法一（熊猫）：

python：将（字符串）集列表转换为 scipy csr_matrix

问题描述

1 个解决方案

解决方案1 1 2020-09-26 13:55:55

Method 2 (no pandas):方法2（没有熊猫）：

Method 1 (pandas):方法一（熊猫）：

解决方案1
1 2020-09-26 13:55:55