![](/img/trans.png)
[英]TensorFlow MirroredStrategy() not working for multi-gpu training
[英]Multi-GPU training of AllenNLP coreference resolution
我正在嘗試復制(或接近於)有關CoNLL-2012共享任務 的端到端神經共治決議論文所獲得的結果。 我打算做在此之上的一些增強功能,所以我決定用AllenNLP的CoreferenceResolver
。 這就是我初始化和訓練模型的方式:
import torch
from allennlp.common import Params
from allennlp.data import Vocabulary
from allennlp.data.dataset_readers import ConllCorefReader
from allennlp.data.dataset_readers.dataset_utils import Ontonotes
from allennlp.data.iterators import BasicIterator, MultiprocessIterator
from allennlp.data.token_indexers import SingleIdTokenIndexer, TokenCharactersIndexer
from allennlp.models import CoreferenceResolver
from allennlp.modules import Embedding, FeedForward
from allennlp.modules.seq2seq_encoders import PytorchSeq2SeqWrapper
from allennlp.modules.seq2vec_encoders import CnnEncoder
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders import TokenCharactersEncoder
from allennlp.training import Trainer
from allennlp.training.learning_rate_schedulers import LearningRateScheduler
from torch.nn import LSTM, ReLU
from torch.optim import Adam
def read_data(directory_path):
data = []
for file_path in Ontonotes().dataset_path_iterator(directory_path):
data += dataset_reader.read(file_path)
return data
INPUT_FILE_PATH_TEMPLATE = "data/CoNLL-2012/v4/data/%s"
dataset_reader = ConllCorefReader(10, {"tokens": SingleIdTokenIndexer(),
"token_characters": TokenCharactersIndexer()})
training_data = read_data(INPUT_FILE_PATH_TEMPLATE % "train")
validation_data = read_data(INPUT_FILE_PATH_TEMPLATE % "development")
vocabulary = Vocabulary.from_instances(training_data + validation_data)
model = CoreferenceResolver(vocab=vocabulary,
text_field_embedder=BasicTextFieldEmbedder({"tokens": Embedding.from_params(vocabulary, Params({"embedding_dim": embeddings_dimension, "pretrained_file": "glove.840B.300d.txt"})),
"token_characters": TokenCharactersEncoder(embedding=Embedding(num_embeddings=vocabulary.get_vocab_size("token_characters"), embedding_dim=8, vocab_namespace="token_characters"),
encoder=CnnEncoder(embedding_dim=8, num_filters=50, ngram_filter_sizes=(3, 4, 5), output_dim=100))}),
context_layer=PytorchSeq2SeqWrapper(LSTM(input_size=400, hidden_size=200, num_layers=1, dropout=0.2, bidirectional=True, batch_first=True)),
mention_feedforward=FeedForward(input_dim=1220, num_layers=2, hidden_dims=[150, 150], activations=[ReLU(), ReLU()], dropout=[0.2, 0.2]),
antecedent_feedforward=FeedForward(input_dim=3680, num_layers=2, hidden_dims=[150, 150], activations=[ReLU(), ReLU()], dropout=[0.2, 0.2]),
feature_size=20,
max_span_width=10,
spans_per_word=0.4,
max_antecedents=250,
lexical_dropout=0.5)
if torch.cuda.is_available():
cuda_device = 0
model = model.cuda(cuda_device)
else:
cuda_device = -1
iterator = BasicIterator(batch_size=1)
iterator.index_with(vocabulary)
optimiser = Adam(model.parameters(), weight_decay=0.1)
Trainer(model=model,
train_dataset=training_data,
validation_dataset=validation_data,
optimizer=optimiser,
learning_rate_scheduler=LearningRateScheduler.from_params(optimiser, Params({"type": "step", "step_size": 100})),
iterator=iterator,
num_epochs=150,
patience=1,
cuda_device=cuda_device).train()
讀取數據后,我對模型進行了訓練,但是GPU內存不足: RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached)
RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached)
。 因此,我嘗試利用多個GPU訓練此模型。 我正在使用Tesla K80(具有12GiB內存)。
我試圖通過將iterator
為MultiprocessIterator(BasicIterator(batch_size=1), num_workers=torch.cuda.device_count())
嘗試利用AllenNLP的MultiprocessIterator
。 但是,僅使用了1個GPU(通過nvidia-smi
命令監視內存使用情況)並出現以下錯誤。 我還嘗試擺弄它的參數(增加num_workers
或減少output_queue_size
)和ulimit
(如本PyTorch問題所述 )無效。
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.local/lib/python3.6/site-packages/allennlp/data/iterators/multiprocess_iterator.py", line 32, in _create_tensor_dicts
output_queue.put(tensor_dict)
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.local/lib/python3.6/site-packages/allennlp/data/iterators/multiprocess_iterator.py", line 32, in _create_tensor_dicts
output_queue.put(tensor_dict)
File "<string>", line 2, in put
File "<string>", line 2, in put
File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
raise convert_to_error(kind, result)
File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError:
---------------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/managers.py", line 228, in serve_client
request = recv()
File "/usr/lib/python3.6/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/home/user/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 276, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
---------------------------------------------------------------------------
我還嘗試通過PyTorch的DataParallel實現此目標 ,方法是使用自定義的DataParallelWrapper
包裝該模型的context_layer
, mention_feedforward
, antecedent_feedforward
(以提供與AllenNLP假定的類函數的兼容性)。 盡管如此,僅使用了1個GPU,它最終還是像以前一樣耗盡了內存。
class DataParallelWrapper(DataParallel):
def __init__(self, module):
super().__init__(module)
def get_output_dim(self):
return self.module.get_output_dim()
def get_input_dim(self):
return self.module.get_input_dim()
def forward(self, *inputs):
return self.module.forward(inputs)
在仔細研究了代碼之后,我發現AllenNLP直接通過其Trainer在后台進行此操作。 cuda_device
可以是單個int
(對於單處理)或int
list
(對於多處理):
cuda_device
:Union[int, List[int]]
,可選(默認= -1)一個整數或整數列表,指定要使用的CUDA設備。 如果為-1,則使用CPU。
因此,應該傳遞所有需要的GPU設備:
if torch.cuda.is_available():
cuda_device = list(range(torch.cuda.device_count()))
model = model.cuda(cuda_device[0])
else:
cuda_device = -1
請注意,仍然必須將model
手動移動到GPU(通過model.cuda(...)
),否則它將嘗試使用多個CPU。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.