AllenNLP共参考分辨率的多GPU训练

Question

I'm trying to replicate (or come close) to the results obtained by the End-to-end Neural Coreference Resolution paper on the CoNLL-2012 shared task . 我正在尝试复制（或接近于）有关CoNLL-2012共享任务的端到端神经共治决议论文所获得的结果。 I intend to do some enhancements on top of this, so I decided to use AllenNLP's CoreferenceResolver . 我打算做在此之上的一些增强功能，所以我决定用AllenNLP的CoreferenceResolver 。 This is how I'm initialising & training the model: 这就是我初始化和训练模型的方式：

import torch
from allennlp.common import Params
from allennlp.data import Vocabulary
from allennlp.data.dataset_readers import ConllCorefReader
from allennlp.data.dataset_readers.dataset_utils import Ontonotes
from allennlp.data.iterators import BasicIterator, MultiprocessIterator
from allennlp.data.token_indexers import SingleIdTokenIndexer, TokenCharactersIndexer
from allennlp.models import CoreferenceResolver
from allennlp.modules import Embedding, FeedForward
from allennlp.modules.seq2seq_encoders import PytorchSeq2SeqWrapper
from allennlp.modules.seq2vec_encoders import CnnEncoder
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders import TokenCharactersEncoder
from allennlp.training import Trainer
from allennlp.training.learning_rate_schedulers import LearningRateScheduler
from torch.nn import LSTM, ReLU
from torch.optim import Adam


def read_data(directory_path):
    data = []
    for file_path in Ontonotes().dataset_path_iterator(directory_path):
        data += dataset_reader.read(file_path)
    return data


INPUT_FILE_PATH_TEMPLATE = "data/CoNLL-2012/v4/data/%s"
dataset_reader = ConllCorefReader(10, {"tokens": SingleIdTokenIndexer(),
                                       "token_characters": TokenCharactersIndexer()})
training_data = read_data(INPUT_FILE_PATH_TEMPLATE % "train")
validation_data = read_data(INPUT_FILE_PATH_TEMPLATE % "development")

vocabulary = Vocabulary.from_instances(training_data + validation_data)
model = CoreferenceResolver(vocab=vocabulary,
                            text_field_embedder=BasicTextFieldEmbedder({"tokens": Embedding.from_params(vocabulary, Params({"embedding_dim": embeddings_dimension, "pretrained_file": "glove.840B.300d.txt"})),
                                                                        "token_characters": TokenCharactersEncoder(embedding=Embedding(num_embeddings=vocabulary.get_vocab_size("token_characters"), embedding_dim=8, vocab_namespace="token_characters"),
                                                                                                                   encoder=CnnEncoder(embedding_dim=8, num_filters=50, ngram_filter_sizes=(3, 4, 5), output_dim=100))}),
                            context_layer=PytorchSeq2SeqWrapper(LSTM(input_size=400, hidden_size=200, num_layers=1, dropout=0.2, bidirectional=True, batch_first=True)),
                            mention_feedforward=FeedForward(input_dim=1220, num_layers=2, hidden_dims=[150, 150], activations=[ReLU(), ReLU()], dropout=[0.2, 0.2]),
                            antecedent_feedforward=FeedForward(input_dim=3680, num_layers=2, hidden_dims=[150, 150], activations=[ReLU(), ReLU()], dropout=[0.2, 0.2]),
                            feature_size=20,
                            max_span_width=10,
                            spans_per_word=0.4,
                            max_antecedents=250,
                            lexical_dropout=0.5)

if torch.cuda.is_available():
    cuda_device = 0
    model = model.cuda(cuda_device)
else:
    cuda_device = -1

iterator = BasicIterator(batch_size=1)
iterator.index_with(vocabulary)
optimiser = Adam(model.parameters(), weight_decay=0.1)
Trainer(model=model,
        train_dataset=training_data,
        validation_dataset=validation_data,
        optimizer=optimiser,
        learning_rate_scheduler=LearningRateScheduler.from_params(optimiser, Params({"type": "step", "step_size": 100})),
        iterator=iterator,
        num_epochs=150,
        patience=1,
        cuda_device=cuda_device).train()

After reading the data I've trained the model but ran out of GPU memory: RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached) 读取数据后，我对模型进行了训练，但是GPU内存不足： RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached) RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached) . RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached) 。 Therefore, I attempted to make use of multiple GPUs to train this model. 因此，我尝试利用多个GPU训练此模型。 I'm making use of Tesla K80s (which have 12GiB memory). 我正在使用Tesla K80（具有12GiB内存）。

I've tried making use of AllenNLP's MultiprocessIterator , by itialising the iterator as MultiprocessIterator(BasicIterator(batch_size=1), num_workers=torch.cuda.device_count()) . 我试图通过将iterator为MultiprocessIterator(BasicIterator(batch_size=1), num_workers=torch.cuda.device_count())尝试利用AllenNLP的MultiprocessIterator 。 However, only 1 GPU is being used (by monitoring the memory usage through the nvidia-smi command) & got the error below. 但是，仅使用了1个GPU（通过nvidia-smi命令监视内存使用情况）并出现以下错误。 I also tried fiddling with its parameters (increasing num_workers or decreasing output_queue_size ) & the ulimit (as mentioned by this PyTorch issue ) to no avail. 我还尝试摆弄它的参数（增加num_workers或减少output_queue_size ）和ulimit （如本PyTorch问题所述）无效。

Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/.local/lib/python3.6/site-packages/allennlp/data/iterators/multiprocess_iterator.py", line 32, in _create_tensor_dicts
    output_queue.put(tensor_dict)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/.local/lib/python3.6/site-packages/allennlp/data/iterators/multiprocess_iterator.py", line 32, in _create_tensor_dicts
    output_queue.put(tensor_dict)
  File "<string>", line 2, in put
  File "<string>", line 2, in put
  File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
  File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/managers.py", line 228, in serve_client
    request = recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "/home/user/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 276, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata
---------------------------------------------------------------------------

I also tried achieving this through PyTorch's DataParallel , by wrapping the model's context_layer , mention_feedforward , antecedent_feedforward with a custom DataParallelWrapper (to provide compatibility with the AllenNLP-assumed class functions). 我还尝试通过PyTorch的DataParallel实现此目标，方法是使用自定义的DataParallelWrapper包装该模型的context_layer ， mention_feedforward ， antecedent_feedforward （以提供与AllenNLP假定的类函数的兼容性）。 Still, only 1 GPU is used & it eventually runs out of memory as before. 尽管如此，仅使用了1个GPU，它最终还是像以前一样耗尽了内存。

class DataParallelWrapper(DataParallel):
    def __init__(self, module):
        super().__init__(module)

    def get_output_dim(self):
        return self.module.get_output_dim()

    def get_input_dim(self):
        return self.module.get_input_dim()

    def forward(self, *inputs):
        return self.module.forward(inputs)

Answer 1

After some digging through the code I found out that AllenNLP does this under the hood directly through its Trainer . 在仔细研究了代码之后，我发现AllenNLP直接通过其Trainer在后台进行此操作。 The cuda_device can either be a single int (in the case of single-processing) or a list of int s (in the case of multi-processing): cuda_device可以是单个int （对于单处理）或int list （对于多处理）：

cuda_device : Union[int, List[int]] , optional (default = -1) An integer or list of integers specifying the CUDA device(s) to use. cuda_device ： Union[int, List[int]] ，可选（默认= -1）一个整数或整数列表，指定要使用的CUDA设备。 If -1, the CPU is used. 如果为-1，则使用CPU。

So all GPU devices needed should be passed on instead: 因此，应该传递所有需要的GPU设备：

if torch.cuda.is_available():
    cuda_device = list(range(torch.cuda.device_count()))
    model = model.cuda(cuda_device[0])
else:
    cuda_device = -1

Note that the model still has to be manually moved to the GPU (via model.cuda(...) ), as it would otherwise try to use multiple CPUs instead. 请注意，仍然必须将model手动移动到GPU（通过model.cuda(...) ），否则它将尝试使用多个CPU。

AllenNLP共参考分辨率的多GPU训练

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-08-05 15:55:30

AllenNLP共参考分辨率的多GPU训练

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-08-05 15:55:30

解决方案1
0 已采纳 2019-08-05 15:55:30