简体   繁体   English

AWS Sagemaker | 如何训练文本数据| 机票分类

[英]AWS Sagemaker | how to train text data | For ticket classification

I am new to Sagemaker and not sure how to classify the text input in AWS sagemaker, 我是Sagemaker的新手,不确定如何对AWS sagemaker中的文本输入进行分类,

Suppose I have a Dataframe having two fields like 'Ticket' and 'Category', Both are text input, Now I want to split it test and training set and upload in Sagemaker training model. 假设我有一个数据框,其中有两个字段,例如“ Ticket”和“ Category”,都是文本输入,现在我想将其拆分为测试和训练集,并上传到Sagemaker训练模型中。

X_train, X_test, y_train, y_test = model_selection.train_test_split(fewRecords['Ticket'],fewRecords['Category'])

Now as I want to perform TD-IDF feature extraction and then convert it to numeric value, so performing this operation 现在,因为我要执行TD-IDF特征提取,然后将其转换为数值,所以执行此操作

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(fewRecords['Category'])
xtrain_tfidf =  tfidf_vect.transform(X_train)
xvalid_tfidf =  tfidf_vect.transform(X_test)

When I want to upload the model in Sagemaker so I can perform next operation like 当我想在Sagemaker中上传模型时,可以执行下一个操作,例如

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
buf.seek(0)

I am getting this error 我收到此错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-8055e6cdbf34> in <module>()
      1 buf = io.BytesIO()
----> 2 smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
      3 buf.seek(0)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
     98             raise ValueError("Label shape {} not compatible with array shape {}".format(
     99                              labels.shape, array.shape))
--> 100         resolved_label_type = _resolve_type(labels.dtype)
    101     resolved_type = _resolve_type(array.dtype)
    102 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
    205     elif dtype == np.dtype('float32'):
    206         return 'Float32'
--> 207     raise ValueError('Unsupported dtype {} on array'.format(dtype))

ValueError: Unsupported dtype object on array

Other than this exception, I am not clear if this is right way as TfidfVectorizer convert the series to Matrix. 除了此例外,我不清楚这是否正确,因为TfidfVectorizer将该系列转换为Matrix。

The code is predicting fine on my local machine but not sure how to do the same on Sagemaker, All the example mentioned there are too lengthy and not for the person who still reached to SciKit Learn 代码在我的本地计算机上运行良好,但是不确定如何在Sagemaker上执行相同的操作。上面提到的所有示例都很冗长,对于仍然接触SciKit Learn的人来说不

The output of TfidfVectorizer is a scipy sparse matrix, not a simple numpy array. TfidfVectorizer的输出是一个稀疏矩阵,而不是简单的numpy数组。

So either use a different function like: 因此,请使用其他功能,例如:

write_spmatrix_to_sparse_tensor write_spmatrix_to_sparse_tensor

"""Writes a scipy sparse matrix to a sparse tensor""" “”“将稀疏矩阵写入稀疏张量”“”“

See this issue for more details. 有关更多详细信息,请参见此问题

OR first convert the output of TfidfVectorizer to a dense numpy array and then use your above code 先将TfidfVectorizer的输出转换为密集的numpy数组,然后使用上面的代码

xtrain_tfidf =  tfidf_vect.transform(X_train).toarray()   
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
...
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 SageMaker AWS二进制文本分类 - SageMaker AWS Binary Text Classification 如何在AWS Sagemaker中训练自己的模型? - How to train your own model in AWS Sagemaker? 如何选择超参数来训练AWS sagemaker上的良好物体检测模型? - How to choose hyperparameters to train a good object detection model on AWS sagemaker? 训练 spacy 进行文本分类 - train spacy for text classification 如何使用数据帧中的不同标签训练 spacy 文本分类? - How to train spacy text classification with different labels from a dataframe? 如何使用 Glove 和 CNN 配置和训练模型进行文本分类? - How to configure and train the model using Glove and CNN for text classification? 如何使用 tf.data 管道训练图像分类模型? - How to train image classification model using tf.data pipeline? 如何训练大型数据集进行分类 - How to train large Dataset for classification 如何将存储桶中的图像数据加载到 AWS sagemaker notebook? - how to load image data from the bucket to AWS sagemaker notebook? 每个训练数据的类标签分布不均匀的多标签文本分类 - Multi-label text classification with non-uniform distribution of class labels for every train data
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM