AWS Sagemaker | 如何训练文本数据| 机票分类

Question

我是Sagemaker的新手，不确定如何对AWS sagemaker中的文本输入进行分类，

假设我有一个数据框，其中有两个字段，例如“ Ticket”和“ Category”，都是文本输入，现在我想将其拆分为测试和训练集，并上传到Sagemaker训练模型中。

X_train, X_test, y_train, y_test = model_selection.train_test_split(fewRecords['Ticket'],fewRecords['Category'])

现在，因为我要执行TD-IDF特征提取，然后将其转换为数值，所以执行此操作

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(fewRecords['Category'])
xtrain_tfidf =  tfidf_vect.transform(X_train)
xvalid_tfidf =  tfidf_vect.transform(X_test)

当我想在Sagemaker中上传模型时，可以执行下一个操作，例如

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
buf.seek(0)

我收到此错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-8055e6cdbf34> in <module>()
      1 buf = io.BytesIO()
----> 2 smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
      3 buf.seek(0)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
     98             raise ValueError("Label shape {} not compatible with array shape {}".format(
     99                              labels.shape, array.shape))
--> 100         resolved_label_type = _resolve_type(labels.dtype)
    101     resolved_type = _resolve_type(array.dtype)
    102 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
    205     elif dtype == np.dtype('float32'):
    206         return 'Float32'
--> 207     raise ValueError('Unsupported dtype {} on array'.format(dtype))

ValueError: Unsupported dtype object on array

除了此例外，我不清楚这是否正确，因为TfidfVectorizer将该系列转换为Matrix。

代码在我的本地计算机上运行良好，但是不确定如何在Sagemaker上执行相同的操作。上面提到的所有示例都很冗长，对于仍然接触SciKit Learn的人来说不

Answer 1

TfidfVectorizer的输出是一个稀疏矩阵，而不是简单的numpy数组。

因此，请使用其他功能，例如：

write_spmatrix_to_sparse_tensor

“”“将稀疏矩阵写入稀疏张量”“”“

有关更多详细信息，请参见此问题。

或先将TfidfVectorizer的输出转换为密集的numpy数组，然后使用上面的代码

xtrain_tfidf =  tfidf_vect.transform(X_train).toarray()   
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
...
...

AWS Sagemaker | 如何训练文本数据| 机票分类

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-08-29 11:06:41

AWS Sagemaker | 如何训练文本数据| 机票分类

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-08-29 11:06:41

解决方案1
1 已采纳 2018-08-29 11:06:41