[英]Multi-GPU training of AllenNLP coreference resolution
I'm trying to replicate (or come close) to the results obtained by the End-to-end Neural Coreference Resolution paper on the CoNLL-2012 shared task . 我正在尝试复制(或接近于)有关CoNLL-2012共享任务 的端到端神经共治决议论文所获得的结果。 I intend to do some enhancements on top of this, so I decided to use AllenNLP's
CoreferenceResolver
. 我打算做在此之上的一些增强功能,所以我决定用AllenNLP的
CoreferenceResolver
。 This is how I'm initialising & training the model: 这就是我初始化和训练模型的方式:
import torch
from allennlp.common import Params
from allennlp.data import Vocabulary
from allennlp.data.dataset_readers import ConllCorefReader
from allennlp.data.dataset_readers.dataset_utils import Ontonotes
from allennlp.data.iterators import BasicIterator, MultiprocessIterator
from allennlp.data.token_indexers import SingleIdTokenIndexer, TokenCharactersIndexer
from allennlp.models import CoreferenceResolver
from allennlp.modules import Embedding, FeedForward
from allennlp.modules.seq2seq_encoders import PytorchSeq2SeqWrapper
from allennlp.modules.seq2vec_encoders import CnnEncoder
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders import TokenCharactersEncoder
from allennlp.training import Trainer
from allennlp.training.learning_rate_schedulers import LearningRateScheduler
from torch.nn import LSTM, ReLU
from torch.optim import Adam
def read_data(directory_path):
data = []
for file_path in Ontonotes().dataset_path_iterator(directory_path):
data += dataset_reader.read(file_path)
return data
INPUT_FILE_PATH_TEMPLATE = "data/CoNLL-2012/v4/data/%s"
dataset_reader = ConllCorefReader(10, {"tokens": SingleIdTokenIndexer(),
"token_characters": TokenCharactersIndexer()})
training_data = read_data(INPUT_FILE_PATH_TEMPLATE % "train")
validation_data = read_data(INPUT_FILE_PATH_TEMPLATE % "development")
vocabulary = Vocabulary.from_instances(training_data + validation_data)
model = CoreferenceResolver(vocab=vocabulary,
text_field_embedder=BasicTextFieldEmbedder({"tokens": Embedding.from_params(vocabulary, Params({"embedding_dim": embeddings_dimension, "pretrained_file": "glove.840B.300d.txt"})),
"token_characters": TokenCharactersEncoder(embedding=Embedding(num_embeddings=vocabulary.get_vocab_size("token_characters"), embedding_dim=8, vocab_namespace="token_characters"),
encoder=CnnEncoder(embedding_dim=8, num_filters=50, ngram_filter_sizes=(3, 4, 5), output_dim=100))}),
context_layer=PytorchSeq2SeqWrapper(LSTM(input_size=400, hidden_size=200, num_layers=1, dropout=0.2, bidirectional=True, batch_first=True)),
mention_feedforward=FeedForward(input_dim=1220, num_layers=2, hidden_dims=[150, 150], activations=[ReLU(), ReLU()], dropout=[0.2, 0.2]),
antecedent_feedforward=FeedForward(input_dim=3680, num_layers=2, hidden_dims=[150, 150], activations=[ReLU(), ReLU()], dropout=[0.2, 0.2]),
feature_size=20,
max_span_width=10,
spans_per_word=0.4,
max_antecedents=250,
lexical_dropout=0.5)
if torch.cuda.is_available():
cuda_device = 0
model = model.cuda(cuda_device)
else:
cuda_device = -1
iterator = BasicIterator(batch_size=1)
iterator.index_with(vocabulary)
optimiser = Adam(model.parameters(), weight_decay=0.1)
Trainer(model=model,
train_dataset=training_data,
validation_dataset=validation_data,
optimizer=optimiser,
learning_rate_scheduler=LearningRateScheduler.from_params(optimiser, Params({"type": "step", "step_size": 100})),
iterator=iterator,
num_epochs=150,
patience=1,
cuda_device=cuda_device).train()
After reading the data I've trained the model but ran out of GPU memory: RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached)
读取数据后,我对模型进行了训练,但是GPU内存不足:
RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached)
RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached)
. RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached)
。 Therefore, I attempted to make use of multiple GPUs to train this model. 因此,我尝试利用多个GPU训练此模型。 I'm making use of Tesla K80s (which have 12GiB memory).
我正在使用Tesla K80(具有12GiB内存)。
I've tried making use of AllenNLP's MultiprocessIterator
, by itialising the iterator
as MultiprocessIterator(BasicIterator(batch_size=1), num_workers=torch.cuda.device_count())
. 我试图通过将
iterator
为MultiprocessIterator(BasicIterator(batch_size=1), num_workers=torch.cuda.device_count())
尝试利用AllenNLP的MultiprocessIterator
。 However, only 1 GPU is being used (by monitoring the memory usage through the nvidia-smi
command) & got the error below. 但是,仅使用了1个GPU(通过
nvidia-smi
命令监视内存使用情况)并出现以下错误。 I also tried fiddling with its parameters (increasing num_workers
or decreasing output_queue_size
) & the ulimit
(as mentioned by this PyTorch issue ) to no avail. 我还尝试摆弄它的参数(增加
num_workers
或减少output_queue_size
)和ulimit
(如本PyTorch问题所述 )无效。
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.local/lib/python3.6/site-packages/allennlp/data/iterators/multiprocess_iterator.py", line 32, in _create_tensor_dicts
output_queue.put(tensor_dict)
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.local/lib/python3.6/site-packages/allennlp/data/iterators/multiprocess_iterator.py", line 32, in _create_tensor_dicts
output_queue.put(tensor_dict)
File "<string>", line 2, in put
File "<string>", line 2, in put
File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
raise convert_to_error(kind, result)
File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError:
---------------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/managers.py", line 228, in serve_client
request = recv()
File "/usr/lib/python3.6/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/home/user/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 276, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
---------------------------------------------------------------------------
I also tried achieving this through PyTorch's DataParallel , by wrapping the model's context_layer
, mention_feedforward
, antecedent_feedforward
with a custom DataParallelWrapper
(to provide compatibility with the AllenNLP-assumed class functions). 我还尝试通过PyTorch的DataParallel实现此目标 ,方法是使用自定义的
DataParallelWrapper
包装该模型的context_layer
, mention_feedforward
, antecedent_feedforward
(以提供与AllenNLP假定的类函数的兼容性)。 Still, only 1 GPU is used & it eventually runs out of memory as before. 尽管如此,仅使用了1个GPU,它最终还是像以前一样耗尽了内存。
class DataParallelWrapper(DataParallel):
def __init__(self, module):
super().__init__(module)
def get_output_dim(self):
return self.module.get_output_dim()
def get_input_dim(self):
return self.module.get_input_dim()
def forward(self, *inputs):
return self.module.forward(inputs)
After some digging through the code I found out that AllenNLP does this under the hood directly through its Trainer . 在仔细研究了代码之后,我发现AllenNLP直接通过其Trainer在后台进行此操作。 The
cuda_device
can either be a single int
(in the case of single-processing) or a list
of int
s (in the case of multi-processing): cuda_device
可以是单个int
(对于单处理)或int
list
(对于多处理):
cuda_device
:Union[int, List[int]]
, optional (default = -1) An integer or list of integers specifying the CUDA device(s) to use.cuda_device
:Union[int, List[int]]
,可选(默认= -1)一个整数或整数列表,指定要使用的CUDA设备。 If -1, the CPU is used.如果为-1,则使用CPU。
So all GPU devices needed should be passed on instead: 因此,应该传递所有需要的GPU设备:
if torch.cuda.is_available():
cuda_device = list(range(torch.cuda.device_count()))
model = model.cuda(cuda_device[0])
else:
cuda_device = -1
Note that the model
still has to be manually moved to the GPU (via model.cuda(...)
), as it would otherwise try to use multiple CPUs instead. 请注意,仍然必须将
model
手动移动到GPU(通过model.cuda(...)
),否则它将尝试使用多个CPU。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.