[英]Model performance not improving during federated learning training
I have followed this emnist tutorial to create an image classification experiment (7 classes) with the aim of training a classifier on 3 silos of data with the TFF framework.我按照这个 emnist 教程创建了一个图像分类实验(7 个类),目的是使用 TFF 框架在 3 个数据孤岛上训练分类器。
Before training begins, I convert the model to a tf keras model using tff.learning.assign_weights_to_keras_model(model,state.model)
to evaluate on my validation set. Before training begins, I convert the model to a tf keras model using tff.learning.assign_weights_to_keras_model(model,state.model)
to evaluate on my validation set. Regardless of the label, the model only predicts one class.不管 label,model 只预测一个 class。 This is to be expected as no training of the model has occurred yet.这是可以预料的,因为尚未对 model 进行培训。 However, I repeat this step after each federated averaging round and the problem persists.但是,我在每轮联合平均后重复此步骤,但问题仍然存在。 All validation images are predicted to one class.所有验证图像都预测为一个 class。 I also save the tf keras model weights after each round and make predictions on the test set - no changes.我还在每一轮之后保存了 tf keras model 权重,并对测试集进行预测 - 没有变化。
Some of the steps I have taken to check the source of the issue:我已采取一些步骤来检查问题的根源:
Model details: Model 详细信息:
The model uses the XceptionNet as the base model with the weights unfrozen. model 使用 XceptionNet 作为基础 model,权重未冻结。 This performs well on the classification task when all the training images are pooled into a global dataset.当所有训练图像都汇集到一个全局数据集中时,这在分类任务上表现良好。 Our aim is to hopefully achieve a comparable performance with FL.我们的目标是希望获得与 FL 相当的性能。
base_model = Xception(include_top=False,
weights=weights,
pooling='max',
input_shape=input_shape)
x = GlobalAveragePooling2D()( x )
predictions = Dense( num_classes, activation='softmax' )( x )
model = Model( base_model.input, outputs=predictions )
Here is my training code:这是我的训练代码:
def fit(self):
"""Train FL model"""
# self.load_data()
summary_writer = tf.summary.create_file_writer(
self.logs_dir
)
federated_averaging = self._construct_iterative_process()
state = federated_averaging.initialize()
tfkeras_model = self._convert_to_tfkeras_model( state )
print( np.argmax( tfkeras_model.predict( self.val_data ), axis=-1 ) )
val_loss, val_acc = tfkeras_model.evaluate( self.val_data, steps=100 )
with summary_writer.as_default():
for round_num in tqdm( range( 1, self.num_rounds ), ascii=True, desc="FedAvg Rounds" ):
print( "Beginning fed avg round..." )
# Round of federated averaging
state, metrics = federated_averaging.next(
state,
self.training_data
)
print( "Fed avg round complete" )
# Saving logs
for name, value in metrics._asdict().items():
tf.summary.scalar(
name,
value,
step=round_num
)
print( "round {:2d}, metrics={}".format( round_num, metrics ) )
tff.learning.assign_weights_to_keras_model(
tfkeras_model,
state.model
)
# tfkeras_model = self._convert_to_tfkeras_model(
# state
# )
val_metrics = {}
val_metrics["val_loss"], val_metrics["val_acc"] = tfkeras_model.evaluate(
self.val_data,
steps=100
)
for name, metric in val_metrics.items():
tf.summary.scalar(
name=name,
data=metric,
step=round_num
)
self._checkpoint_tfkeras_model(
tfkeras_model,
round_num,
self.checkpoint_dir
)
def _checkpoint_tfkeras_model(self,
model,
round_number,
checkpoint_dir):
# Obtaining model dir path
model_dir = os.path.join(
checkpoint_dir,
f'round_{round_number}',
)
# Creating directory
pathlib.Path(
model_dir
).mkdir(
parents=True
)
model_path = os.path.join(
model_dir,
f'model_file_round{round_number}.h5'
)
# Saving model
model.save(
model_path
)
def _convert_to_tfkeras_model(self, state):
"""Converts global TFF modle of TF keras model
Takes the weights of the global model
and pushes them back into a standard
Keras model
Args:
state: The state of the FL server
containing the model and
optimization state
Returns:
(model); TF Keras model
"""
model = self._load_tf_keras_model()
model.compile(
loss=self.loss,
metrics=self.metrics
)
tff.learning.assign_weights_to_keras_model(
model,
state.model
)
return model
def _load_tf_keras_model(self):
"""Loads tf keras models
Raises:
KeyError: A model name was not defined
correctly
Returns:
(model): TF keras model object
"""
model = create_models(
model_type=self.model_type,
input_shape=[self.img_h, self.img_w, 3],
freeze_base_weights=self.freeze_weights,
num_classes=self.num_classes,
compile_model=False
)
return model
def _define_model(self):
"""Model creation function"""
model = self._load_tf_keras_model()
tff_model = tff.learning.from_keras_model(
model,
dummy_batch=self.sample_batch,
loss=self.loss,
# Using self.metrics throws an error
metrics=[tf.keras.metrics.CategoricalAccuracy()] )
return tff_model
def _construct_iterative_process(self):
"""Constructing federated averaging process"""
iterative_process = tff.learning.build_federated_averaging_process(
self._define_model,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD( learning_rate=0.02 ),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD( learning_rate=1.0 ) )
return iterative_process
- Increased the number of rounds to 5...将回合数增加到5...
Running only a few rounds of federated learning sounds insufficient.只运行几轮联邦学习听起来不够。 One of the earliest Federated Averaging papers ( McMahan 2016 ) required running for hundreds of rounds when the MNIST data had non-iid splits.当 MNIST 数据具有非 iid 分裂时,最早的联邦平均论文之一 ( McMahan 2016 ) 需要运行数百轮。 More recently ( Reddi 2020 ) required thousands of rounds for CIFAR-100.最近( Reddi 2020 )需要数千轮CIFAR-100。 One thing to note is that each "round" is one "step" of the global model.需要注意的一点是,每一“轮”都是全局model的一个“步”。 That step may be larger with more client epochs, but these are averaged and diverging clients may reduce the magnitude of the global step.随着客户端 epoch 的增多,该步长可能会更大,但这些都是平均的,并且不同的客户端可能会降低全局步长的幅度。
I also save the tf keras model weights after each round and make predictions on the test set - no changes.我还在每一轮之后保存了 tf keras model 权重,并对测试集进行预测 - 没有变化。
This can be concerning.这可能令人担忧。 It will be easier to debug if you could share the code used in the FL training loop.如果您可以共享 FL 训练循环中使用的代码,将更容易调试。
Note sure this is an answer, but more a liked observation.请注意,这是一个答案,但更多的是一个喜欢的观察。
I've been trying to characterize the learning process (accuracy and loss) on the Federated Learning for Image Classification notebook tutorial with TFF.我一直在尝试使用 TFF 来描述 Federated Learning for Image Classification notebook 教程中的学习过程(准确性和损失)。
I'm seeing major improvements in speed of convergence by modifying the epoch hyperparameter.通过修改 epoch 超参数,我看到了收敛速度的重大改进。 Changing epochs from 5, 10, 20 etc. But I'm also seeing major increase in training accuracy.从 5、10、20 等更改 epoch。但我也看到训练准确度的大幅提高。 I suspect overfitting is occurring, though then I evaluate on the test set accuracy is still high.我怀疑正在发生过度拟合,但我评估测试集的准确性仍然很高。
Wondering what is going on.想知道发生了什么。 ? ?
My understanding is that the epoch param controls the # of forward/back prop on each client per round of training.我的理解是 epoch 参数控制每轮训练每个客户端的前/后道具的数量。 Is this correct?这个对吗? So ie 10 rounds of training on 10 clients with 10 epochs would be 10 Epochs X 10 Clients X 10 rounds.因此,即 10 轮训练 10 轮的 10 轮训练将是 10 轮 X 10 客户端 X 10 轮。 Realise a lager range of clients is needed etc but I was expecting to see poorer accuracy on the test set.意识到需要更大范围的客户等,但我希望在测试集上看到更差的准确性。
What can I do to see whats going on.我能做些什么来看看发生了什么。 Could I use the evaluation check with something like learning curves to to see if overfitting is occurring?我可以使用带有学习曲线之类的评估检查来查看是否发生过拟合吗?
test_metrics = evaluation(state.model, federated_test_data) Only appears to give a single data point, how can I get the individual test accuracy for each test example validated? test_metrics = evaluation(state.model, federated_test_data) 似乎只给出一个数据点,我怎样才能获得每个验证测试示例的单独测试准确性?
Appreciate any thoughts you may have on the matter, Colin.感谢您对此事的任何想法,科林。 . . . .
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.