使用 Blazingtext 预处理 Sagemaker 推理管道的数据

Question

I'm trying to figure out the best way to preprocess my input data for my inference endpoint for AWS Sagemaker.我正在尝试找出为 AWS Sagemaker 的推理端点预处理输入数据的最佳方法。 I'm using the BlazingText algorithm.我正在使用 BlazingText 算法。

I'm not really sure the best way forward and I would be thankful for any pointers.我不确定最好的前进方式，我会感谢任何指点。

I currently train my model using a Jupyter notebook in Sagemaker and that works wonderfully, but the problem is that I use NLTK to clean my data (Swedish stopwords and stemming etc):我目前使用 Sagemaker 中的 Jupyter 笔记本训练我的 model 并且效果很好，但问题是我使用 NLTK 来清理我的数据（瑞典停用词和词干等）：

import nltk
nltk.download('punkt')
nltk.download('stopwords')

So the question is really, how do I get the same pre-processing logic to the inference endpoint?所以问题真的是，我如何获得与推理端点相同的预处理逻辑？

I have a couple of thoughts about how to proceed:关于如何进行，我有几个想法：

Build a docker container with the python libs & data installed with the sole purpose of pre-processing the data.构建一个 docker 容器，其中安装了 python 库和数据，其唯一目的是预处理数据。 Then use this container in the inference pipeline.然后在推理管道中使用这个容器。
Supply the Python libs and Script to an existing container in the same way you can do for external lib an notebook将 Python 库和脚本提供给现有容器，方法与对笔记本的外部库相同
Build a custom fastText container with the libs I need and run it outside of Sagemaker.使用我需要的库构建一个自定义的 fastText 容器，并在 Sagemaker 之外运行它。
Will probably work, but feels like a "hack": Build a Lambda function that has the proper Python libs&data installed and calls the Sagemaker Endpoint.可能会工作，但感觉就像是“黑客”：构建一个 Lambda function 安装了正确的 Python 库和数据并调用 Sagemaker 端点。 I'm worried about cold start delays as the prediction traffic volume will be low.我担心冷启动延迟，因为预测流量会很低。

I would like to go with the first option, but I'm struggling a bit to understand if there is a docker image that I could build from, and add my dependencies to, or if I need to build something from the ground up.我想使用第一个选项 go ，但我很难理解是否有一个 docker 图像可以构建，并添加我的依赖项，或者我是否需要从头开始构建一些东西。 For instance, would the image sagemaker-sparkml-serving:2.2 be a good candidate?例如，图像 sagemaker-sparkml-serving:2.2 会是一个好的候选者吗？

But maybe there is a better way all around?但也许有更好的方法？

Answer 1

Niclas尼古拉斯

I would suggest you try out BentoML .我建议你试试BentoML 。

BentoML is an open-source framework for high-performance model serving. BentoML 是一个用于高性能 model 服务的开源框架。 It turns ML model into a production API endpoint with just a few lines of code.只需几行代码，它就可以将 ML model 变成生产 API 端点。

For your use case:对于您的用例：

In your notebook on Sagemaker, after you trained your model, You can define a prediction service spec with BentoML.在 Sagemaker 上的笔记本中，训练 model 后，您可以使用 BentoML 定义预测服务规范。 This code builds a BentoService that expect a PickleArtifact, expose an API endpoint called predict, and automatically include pip dependencies这段代码构建了一个需要 PickleArtifact 的 BentoService，公开一个名为 predict 的 API 端点，并自动包含 pip 依赖项

%%write my_service.py

from bentoml import api, BentoService, artifacts, env
from bentoml.artifacts import PickleArtifact
from bentoml.handlers import DataFrameHandler

@env(auto_pip_dependencies=True)
@artifacts([PickleArtifact('my_model')]
class MyTextClassifier(BentoService):
    def preprocess(self, raw_data):
        ...
        return processed_data;

    @api(DataframeHandler)
    def predict(self, df):
        processed_data = self.preprocess(df)
        return self.artifacts.my_model.predict(processed_data)

In the next cell, load the class defined from the previous cell and save the model.在下一个单元格中，加载从前一个单元格定义的 class 并保存 model。

from my_service import MyTextClassification

service = MyTextClassification()
service.pack('my_model', trained_model)
service.save()
# could also saved the model to S3 bucket
# service.save('s3://my_bucket')

You can easily deploy to Sagemaker within the notebook.您可以在笔记本中轻松部署到 Sagemaker。

!bentoml sagemaker deploy my-deployment -b {service.name}:{service.version} --api-name predict

Here is an example of that.这是一个例子。 In this example, it is using Keras to build a text classification model and then deploy the model to Sagemaker for model serving.在此示例中，它使用 Keras 构建文本分类 model，然后将 model 部署到 Sagemaker 以获取 Z20F35E630F8549DDBFAC 服务。

https://github.com/bentoml/gallery/blob/master/keras/text-classification/keras-text-classification.ipynb https://github.com/bentoml/gallery/blob/master/keras/text-classification/keras-text-classification.ipynb

Disclaimer: I am one of the authors for BentoML.免责声明：我是 BentoML 的作者之一。

使用 Blazingtext 预处理 Sagemaker 推理管道的数据

问题描述

1 个解决方案

解决方案1
2 2020-04-09 17:45:33

使用 Blazingtext 预处理 Sagemaker 推理管道的数据

问题描述

1 个解决方案

解决方案1 2 2020-04-09 17:45:33

解决方案1
2 2020-04-09 17:45:33