Sagemaker 從 RecordIO 到稀疏矩陣

Question

在為 Sagemaker 的分解機實現准備用於訓練的數據時，我成功地使用函數write_spmatrix_to_sparse_tensor （源代碼）將我的數據從稀疏矩陣轉換為 Sagemaker 的分解機實現所期望的 recordio 格式。

我將導入語句限制為提供的函數的示例：

import os
import io
import boto3
import sagemaker.amazon.common as smac

def write_recordio(array, y, prefix, f):
    # Convert to record protobuf
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(array=array, file=buf, labels=y)
    buf.seek(0)

    fname = os.path.join(prefix, f)
    boto3.Session().resource('s3').Bucket('bucket_name').Object(fname).upload_fileobj(buf)

具有特征的參數array的示例片段：

   (0, 990290)  1.0
   (0, 1266265) 1.0
   (1, 560338)  1.0
   (1, 1266181) 1.0
   (2, 182872)  1.0
   (2, 1266205) 1.0
   ...

y 的示例格式是我的目標：

[1. 1. 1. ... 3. 1. 5.]

write_spmatrix_to_sparse_tensor按預期與上述函數和輸入一起工作。 在訓練我的模型后，我使用 Sagemaker 的批量轉換來接收一個.out文件，其中包含許多<class 'record_pb2.Record'>類型的輸出

例子：

來自write_spmatrix_to_sparse_tensor輸出的一條記錄：

features {
  key: "values"
  value {
    float32_tensor {
      values: 1.0
      values: 1.0
      keys: 990290
      keys: 1266265
      shape: 1266394
    }
  }
}
label {
  key: "values"
  value {
    float32_tensor {
      values: 1.0
    }
  }
}

批處理轉換輸出 ( .out ) 文件中的一條記錄，其中存在許多這樣的記錄）：

label {
  key: "score"
  value {
    float32_tensor {
      values: 1.5246734619140625
    }
  }
}

所以現在我有一個最初使用write_spmatrix_to_sparse_tensor編寫的文件和來自transformer.transform的輸出，我想從這些文件中恢復到我原來的稀疏矩陣格式。 本質上，如果函數write_sparse_tensor_to_spmatrix存在，它會是什么樣子？

Answer 1

一定會有更好的辦法。 但我學到的是從輸出文件中讀取值。 更改數據類型並將它們改造成正確的格式。 讀取值的示例

data.label['score'].float32_tensor.values

此處的data是輸出文件中的一條記錄。 結果的類型是“google.protobuf.pyext._message.RepeatedScalarContainer”，但您可以將其轉換為 Python 列表或 NumPy 數組或適合您模型的任何數據類型。

Sagemaker 從 RecordIO 到稀疏矩陣

問題描述

1 個解決方案

解決方案1
1 2021-05-08 20:55:11

Sagemaker 從 RecordIO 到稀疏矩陣

問題描述

1 個解決方案

解決方案1 1 2021-05-08 20:55:11

解決方案1
1 2021-05-08 20:55:11