是否可以在不首先在本地持久化的情况下从 GCS 存储桶 URL 加载预训练的 Pytorch 模型？

Question

I'm asking this in the context of Google Dataflow, but also generally.我是在 Google Dataflow 的背景下问这个问题的，但也是一般的。

Using PyTorch, I can reference a local directory containing multiple files that comprise a pretrained model.使用 PyTorch，我可以引用包含多个文件的本地目录，这些文件构成一个预训练模型。 I happen to be working with a Roberta model, but the interface is the same for others.我碰巧使用的是 Roberta 模型，但其他人的界面是一样的。

ls some-directory/
      added_tokens.json
      config.json             
      merges.txt              
      pytorch_model.bin       
      special_tokens_map.json vocab.json

from pytorch_transformers import RobertaModel

# this works
model = RobertaModel.from_pretrained('/path/to/some-directory/')

However, my pretrained model is stored in a GCS bucket.但是，我的预训练模型存储在 GCS 存储桶中。 Let's call it gs://my-bucket/roberta/ .我们称之为gs://my-bucket/roberta/ 。

In the context of loading this model in Google Dataflow, I'm trying to remain stateless and avoid persisting to disk, so my preference would be to get this model straight from GCS.在 Google Dataflow 中加载这个模型的上下文中，我试图保持无状态并避免持久化到磁盘，所以我更喜欢直接从 GCS 获取这个模型。 As I understand it, the PyTorch general interface method from_pretrained() can take the string representation of a local dir OR a URL.据我了解，PyTorch 通用接口方法from_pretrained()可以采用本地目录或 URL 的字符串表示形式。 However, I can't seem to load the model from a GCS URL.但是，我似乎无法从 GCS URL 加载模型。

# this fails
model = RobertaModel.from_pretrained('gs://my-bucket/roberta/')
# ValueError: unable to parse gs://mahmed_bucket/roberta-base as a URL or as a local path

If I try to use the public https URL of the directory blob, it will also fail, although that is likely due to lack of authentication since the credentials referenced in the python environment that can create clients don't translate to public requests to https://storage.googleapis如果我尝试使用目录 blob 的公共 https URL，它也会失败，尽管这可能是由于缺乏身份验证，因为在可以创建客户端的 python 环境中引用的凭据不会转换为对https://storage.googleapis公共请求https://storage.googleapis

# this fails, probably due to auth
bucket = gcs_client.get_bucket('my-bucket')
directory_blob = bucket.blob(prefix='roberta')
model = RobertaModel.from_pretrained(directory_blob.public_url)
# ValueError: No JSON object could be decoded

# and for good measure, it also fails if I append a trailing /
model = RobertaModel.from_pretrained(directory_blob.public_url + '/')
# ValueError: No JSON object could be decoded

I understand that GCS doesn't actually have subdirectories and it's actually just being a flat namespace under the bucket name.我知道GCS 实际上没有子目录，它实际上只是存储桶名称下的一个平面命名空间。 However, it seems like I'm blocked by the necessity of authentication and a PyTorch not speaking gs:// .但是，似乎我被身份验证的必要性和 PyTorch 阻止了gs:// 。

I can get around this by persisting the files locally first.我可以通过首先在本地保存文件来解决这个问题。

from pytorch_transformers import RobertaModel
from google.cloud import storage
import tempfile

local_dir = tempfile.mkdtemp()
gcs = storage.Client()
bucket = gcs.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=blob_prefix)
for blob in blobs:
    blob.download_to_filename(local_dir + '/' + os.path.basename(blob.name))
model = RobertaModel.from_pretrained(local_dir)

But this seems like such a hack, and I keep thinking I must be missing something.但这似乎是一个黑客，我一直在想我一定错过了一些东西。 Surely there's a way to stay stateless and not have to rely on disk persistence!当然有一种方法可以保持无状态，而不必依赖磁盘持久性！

So is there a way to load a pretrained model stored in GCS?那么有没有办法加载存储在 GCS 中的预训练模型？
Is there a way to authenticate when doing the public URL request in this context?在此上下文中执行公共 URL 请求时，是否可以进行身份验证？
Even if there is a way to authenticate, will the non-existence of subdirectories still be an issue?即使有一种方法可以进行身份验证，子目录不存在仍然是一个问题吗？

Thanks for the help!谢谢您的帮助！ I'm also happy to be pointed to any duplicate questions 'cause I sure couldn't find any.我也很高兴被指出任何重复的问题，因为我肯定找不到任何问题。

Edits and Clarifications编辑和澄清

My Python session is already authenticated to GCS , which is why I'm able to download the blob files locally and then point to that local directory with load_frompretrained()我的 Python 会话已经通过 GCS 身份验证，这就是为什么我能够在本地下载 blob 文件，然后使用load_frompretrained()指向该本地目录
load_frompretrained() requires a directory reference because it needs all the files listed at the top of the question, not just pytorch-model.bin load_frompretrained()需要目录引用，因为它需要问题顶部列出的所有文件，而不仅仅是pytorch-model.bin
To clarify question #2, I was wondering if there's some way of giving the PyTorch method a request URL that had encrypted credentials embedded or something like that.为了澄清问题 #2，我想知道是否有某种方法可以为 PyTorch 方法提供一个嵌入了加密凭据或类似内容的请求 URL。 Kind of a longshot, but I wanted to make sure I hadn't missed anything.有点远，但我想确保我没有错过任何东西。
To clarify question #3 (in addition to the comment on one answer below), even if there's a way to embed credentials in the URL that I don't know about, I still need to reference a directory rather than a single blob, and I don't know if the GCS subdirectory would be recognized as such because (as the Google docs state) subdirectories in GCS are an illusion and they don't represent a real directory structure.为了澄清问题 #3（除了对下面一个答案的评论之外），即使有一种方法可以在我不知道的 URL 中嵌入凭据，我仍然需要引用一个目录而不是单个 blob，并且我不知道 GCS 子目录是否会被识别为这样，因为（正如 Google 文档所述）GCS 中的子目录是一种错觉，它们并不代表真正的目录结构。 So I think this question is irrelevant or at least blocked by question #2, but it's a thread I chased for a bit so I'm still curious.所以我认为这个问题无关紧要，或者至少被问题 #2 阻止，但这是我追寻的一个话题，所以我仍然很好奇。

Answer 1

MAJOR EDIT:主要编辑：

You can install wheel files on Dataflow workers, and you can also use worker temp storage to persist binary files locally!您可以在 Dataflow 工作人员上安装轮文件，也可以使用工作人员临时存储在本地持久化二进制文件！

It's true that (currently as of Nov 2019) you can't do this by supplying a --requirements argument.确实（目前截至 2019 年 11 月）您无法通过提供--requirements参数来做到这一点。 Instead you have to use setup.py like this.相反，您必须像这样使用setup.py 。 Assume any constants IN CAPS are defined elsewhere.假设任何常量 IN CAPS 都在别处定义。

REQUIRED_PACKAGES = [
    'torch==1.3.0',
    'pytorch-transformers==1.2.0',
]

setup(
    name='project_dir',
    version=VERSION,
    packages=find_packages(),
    install_requires=REQUIRED_PACKAGES)

Run script运行脚本

python setup.py sdist

python project_dir/my_dataflow_job.py \
--runner DataflowRunner \
--project ${GCP_PROJECT} \
--extra_package dist/project_dir-0.1.0.tar.gz \
# SNIP custom args for your job and required Dataflow Temp and Staging buckets #

And within the job, here's downloading and using the model from GCS in the context of a custom Dataflow operator.在作业中，这里是在自定义 Dataflow 运算符的上下文中从 GCS 下载和使用模型。 For convenience we wrapped a few utility methods in a SEPARATE MODULE (important to get around Dataflow dependency uploads) and imported them at the LOCAL SCOPE of the custom operator, not global.为方便起见，我们将一些实用程序方法包装在一个单独的模块中（对于绕过 Dataflow 依赖上传很重要），并在自定义运算符的 LOCAL SCOPE 而非全局范围内导入它们。

class AddColumn(beam.DoFn):
    PRETRAINED_MODEL = 'gs://my-bucket/blah/roberta-model-files'

    def get_model_tokenizer_wrapper(self):
        import shutil
        import tempfile
        import dataflow_util as util
        try:
            return self.model_tokenizer_wrapper
        except AttributeError:
            tmp_dir = tempfile.mkdtemp() + '/'
            util.download_tree(self.PRETRAINED_MODEL, tmp_dir)
            model, tokenizer = util.create_model_and_tokenizer(tmp_dir)
            model_tokenizer_wrapper = util.PretrainedPyTorchModelWrapper(
                model, tokenizer)
            shutil.rmtree(tmp_dir)
            self.model_tokenizer_wrapper = model_tokenizer_wrapper
            logging.info(
                'Successfully created PretrainedPyTorchModelWrapper')
            return self.model_tokenizer_wrapper

    def process(self, elem):
        model_tokenizer_wrapper = self.get_model_tokenizer_wrapper()

        # And now use that wrapper to process your elem however you need.
        # Note that when you read from BQ your elements are dictionaries
        # of the column names and values for each BQ row.

Utility functions in SEPARATE MODULE within the codebase.代码库中 SEPARATE MODULE 中的实用函数。 In our case in the project root this was in dataflow_util/init.py but you don't have to do it that way.在我们的项目根目录中，它位于 dataflow_util/init.py 中，但您不必那样做。

from contextlib import closing
import logging

import apache_beam as beam
import numpy as np
from pytorch_transformers import RobertaModel, RobertaTokenizer
import torch

class PretrainedPyTorchModelWrapper():
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

def download_tree(gcs_dir, local_dir):
    gcs = beam.io.gcp.gcsio.GcsIO()
    assert gcs_dir.endswith('/')
    assert local_dir.endswith('/')
    for entry in gcs.list_prefix(gcs_dir):
        download_file(gcs, gcs_dir, local_dir, entry)


def download_file(gcs, gcs_dir, local_dir, entry):
    rel_path = entry[len(gcs_dir):]
    dest_path = local_dir + rel_path
    logging.info('Downloading %s', dest_path)
    with closing(gcs.open(entry)) as f_read:
        with open(dest_path, 'wb') as f_write:
            # Download the file in chunks to avoid requiring large amounts of
            # RAM when downloading large files.
            while True:
                file_data_chunk = f_read.read(
                    beam.io.gcp.gcsio.DEFAULT_READ_BUFFER_SIZE)
                if len(file_data_chunk):
                    f_write.write(file_data_chunk)
                else:
                    break


def create_model_and_tokenizer(local_model_path_str):
    """
    Instantiate transformer model and tokenizer

      :param local_model_path_str: string representation of the local path 
             to the directory containing the pretrained model
      :return: model, tokenizer
    """
    model_class, tokenizer_class = (RobertaModel, RobertaTokenizer)

    # Load the pretrained tokenizer and model
    tokenizer = tokenizer_class.from_pretrained(local_model_path_str)
    model = model_class.from_pretrained(local_model_path_str)

    return model, tokenizer

And there you have it folks!伙计们，你有它！ More details can be found here: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/更多细节可以在这里找到： https : //beam.apache.org/documentation/sdks/python-pipeline-dependencies/

What I've discovered is that this whole chain of questioning is irrelevant because Dataflow only allows you to install source distribution packages on workers which means you can't actually install PyTorch.~~我发现这整个质疑链是无关紧要的，因为 Dataflow 只允许您在工作人员上安装源代码分发包，这意味着您实际上无法安装 PyTorch。~~

~~When you supply a requirements.txt file, Dataflow will install with the --no-binary flag which prevents installation of Wheel (.whl) packages and only allows source distributions (.tar.gz).~~~~当您提供一个requirements.txt文件时，Dataflow 将使用--no-binary标志进行安装，这会阻止安装 Wheel (.whl) 软件包并且只允许源分发 (.tar.gz)。~~ ~~I decided trying to roll my own source distribution for PyTorch on Google Dataflow where it's half C++ and part Cuda and part who knows what was a fool's errand.~~~~我决定尝试在 Google Dataflow 上为 PyTorch 推出我自己的源代码分发版，其中一半是 C++，一半是 Cuda，另一部分是知道什么是傻瓜的差事。~~

~~Thanks for the input along the way y'all.~~~~感谢你们一路上的投入。~~

Answer 2

I don't know much about Pytorch or Roberta model, but I'll try to answer your inquiries refering to GCS :我对 Pytorch 或 Roberta 模型不太了解，但我会尝试回答您关于 GCS 的询问：

1.- "So is there a way to load a pretrained model stored in GCS?" 1.-“那么有没有办法加载存储在 GCS 中的预训练模型？”

In case your model can load the Blob directly from binary:如果您的模型可以直接从二进制文件加载 Blob：

from google.cloud import storage

client = storage.Client()
bucket = client.get_bucket("bucket name")
blob = bucket.blob("path_to_blob/blob_name.ext")
data = blob.download_as_string() # you will have your binary data transformed into string here.

2.- "Is there a way to authenticate when doing the public URL request in this context?" 2.-“在这种情况下执行公共 URL 请求时有没有办法进行身份验证？”

Here's the tricky part, because depending on which context you are running the script, it will be authenticated with a default service account.这是棘手的部分，因为根据您运行脚本的上下文，它将使用默认服务帐户进行身份验证。 So when you are using the official GCP libs you can:因此，当您使用官方 GCP 库时，您可以：

A.- Give permissions to that default service account to access to your bucket/objects. A.- 授予该默认服务帐户访问您的存储桶/对象的权限。

B.- Create a new service account and authenticate with it inside the script (you will need to generate the authentication token for that service account as well): B.- 创建一个新的服务帐户并在脚本中对其进行身份验证（您还需要为该服务帐户生成身份验证令牌）：

from google.cloud import storage
from google.oauth2 import service_account

VISION_SCOPES = ['https://www.googleapis.com/auth/devstorage']
SERVICE_ACCOUNT_FILE = 'key.json'

cred = service_account.Credentials.from_service_account_file(SERVICE_ACCOUNT_FILE, scopes=VISION_SCOPES)

client = storage.Client(credentials=cred)
bucket = client.get_bucket("bucket_name")
blob = bucket.blob("path/object.ext")
data = blob.download_as_string()

However that works because the official libs handle the authentication to the API calls on the background, so in the case of from_pretrained() function not work.然而这是有效的，因为官方库在后台处理对 API 调用的身份验证，所以在 from_pretrained() 函数的情况下不起作用。

So an alternative to that is making the object public, so you can access to it when using the public url.因此，另一种方法是将对象设为公开，这样您就可以在使用公共 url 时访问它。

3.- "Even if there is a way to authenticate, will the non-existence of subdirectories still be an issue?" 3.-“即使有一种方法可以进行身份验证，子目录不存在仍然是一个问题吗？”

Not sure that you mean here, you can have folders inside your bucket.不确定您的意思是在这里，您的存储桶中可以有文件夹。

Answer 3

currently im not playing with Roberta, but with Bert for Token classification for NER, but i think it has the same mechanism..目前我没有和 Roberta 一起玩，而是和 Bert 一起玩 NER 的令牌分类，但我认为它具有相同的机制..

below its my code:下面是我的代码：

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'your_gcs_auth.json'

# initiate storage
client = storage.Client()
en_bucket = client.get_bucket('your-gcs-bucketname')

# get blob
en_model_blob = en_bucket.get_blob('your-modelname-in-gcsbucket.bin')
en_model = en_model_blob.download_as_string()

# because model downloaded into string, need to convert it back
buffer = io.BytesIO(en_model)

# prepare loading model
state_dict = torch.load(buffer, map_location=torch.device('cpu'))
model = BertForTokenClassification.from_pretrained(pretrained_model_name_or_path=None, state_dict=state_dict, config=main_config)
model.load_state_dict(state_dict)

im not to sure whether download_as_string() method save the data into local disk or not, but from what i experience if i execute download_to_filename() that function will download the model into my local.我不确定download_as_string()方法是否将数据保存到本地磁盘，但根据我的经验，如果我执行download_to_filename()该函数会将模型下载到我的本地。

also if you modified the config for your transformers network(and you put this in GCS and need to load also), you need to modify class PretrainedConfig as well, since it can handle file produced by download_as_string() function.此外，如果您修改了转换器网络的配置（并将其放入 GCS 并需要加载），您还需要修改类PretrainedConfig ，因为它可以处理由download_as_string()函数生成的文件。

cheers, hope it helps欢呼，希望它有帮助

Answer 4

As you correctly stated, it seems that out of the box pytorch-transformers does not support this, but mainly just because it does not recognize the file link as an URL.正如您正确指出的那样，开箱即用的pytorch-transformers似乎不支持这一点，但这主要是因为它无法将文件链接识别为 URL。

After some searching, I found the corresponding error message in this source file , around line 144-155.经过一番搜索，我在这个源文件中找到了相应的错误信息，大约在第 144-155 行。

Of course, you could try adding your 'gs' tag to line 144, and then interpret your connection to GCS as a http request (lines 269-272).当然，您可以尝试将'gs'标记添加到第 144 行，然后将您与 GCS 的连接解释为http请求（第 269-272 行）。 If GCS accepts this, that should be the only thing required to change in order to work.如果 GCS 接受这一点，那应该是唯一需要改变才能工作的事情。
If this does not work, the only immediate fix would be to implement something analogous to the Amazon S3 bucket functions, but I don't know enough about S3 and GCS buckets to claim any meaningful judgement here.如果这不起作用，唯一的直接解决方法是实现类似于 Amazon S3 存储桶函数的功能，但我对 S3 和 GCS 存储桶的了解还不够，无法在这里做出任何有意义的判断。

是否可以在不首先在本地持久化的情况下从 GCS 存储桶 URL 加载预训练的 Pytorch 模型？

问题描述

4 个解决方案

解决方案1
2 已采纳 2019-10-15 17:13:48

解决方案2
1 2019-09-12 11:30:57

解决方案3
1 2019-11-01 10:13:32

解决方案4
0 2019-09-12 08:27:40

是否可以在不首先在本地持久化的情况下从 GCS 存储桶 URL 加载预训练的 Pytorch 模型？

问题描述

4 个解决方案

解决方案1 2 已采纳 2019-10-15 17:13:48

解决方案2 1 2019-09-12 11:30:57

解决方案3 1 2019-11-01 10:13:32

解决方案4 0 2019-09-12 08:27:40

解决方案1
2 已采纳 2019-10-15 17:13:48

解决方案2
1 2019-09-12 11:30:57

解决方案3
1 2019-11-01 10:13:32

解决方案4
0 2019-09-12 08:27:40