How to Save and Restore sessions model as checkpoint
In this post, we are going to talk about checkpoints, how to save them, and restore them for training.
Checkpoints
Checkpoints are saved the graph of all parameters that are used by a model. checkpoints do not contain a description of the computation defined by the model. checkpoints are binary files in a proprietary format that map variable names to tensor values. The best way to examine the contents of a checkpoint is to load it using a Saver.
Saver
The Saver class helps you to save the variables and restore them. Savers can automatically number checkpoint filenames with a provided counter by an option global_step. This lets you keep multiple checkpoints at different steps while training a model. For example, you can number the checkpoint filenames with the training step number. To avoid filling up disks, savers manage checkpoint files automatically. For example, they can keep only the N most recent files, or one checkpoint for every N hours of training.
Save
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
class CFModel(object): def __init__(self, embedding_vars, loss, metrics=None): self._embedding_vars = embedding_vars self._loss = loss self._metrics = metrics self._embeddings = {k: None for k in embedding_vars} self._session = None <strong> self.checkpoint_dir = "./checkpoints"+storeid+"/" self.saver = tf.train.Saver()</strong> with self._session.as_default(): load =True local_init_op.run() iterations = [] metrics = self._metrics or ({},) metrics_vals = [collections.defaultdict(list) for _ in self._metrics] # Train and append results. for i in range(num_iterations + 1): _, results = self._session.run((train_op, metrics)) if (i % 10 == 0) or i == num_iterations: print("\r iteration %d: " % i + ", ".join( ["%s=%f" % (k, v) for r in results for k, v in r.items()]), end='') iterations.append(i) for metric_val, result in zip(metrics_vals, results): for k, v in result.items(): metric_val[k].append(v) <strong>self.saver.save(self._session, self.checkpoint_dir + 'model.ckpt') </strong> |
after running this code you can find saved checkpoint in your specified directory.
.data: Contains variable values
.meta: Contains graph structure
.index: Identifies checkpoints
restore
All the variables that you need to save in the disk, you can load your saved variables in the session using saver.restore(). we can load through this function and call this function inside your session object.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def _load_(self, sess, checkpoint_dir = None): if checkpoint_dir: self.checkpoint_dir = checkpoint_dir print("loading a session") ckpt = tf.train.get_checkpoint_state(self.checkpoint_dir) if ckpt and ckpt.model_checkpoint_path: self.saver.restore(sess, ckpt.model_checkpoint_path) # for i, var in enumerate(self.saver._var_list): # print('Variables {}: {}'.format(i, var)) else: print("no checkpoint found") return |
In addition, you can see the saved variables by this code as above mentioned.
1 2 |
for i, var in enumerate(self.saver._var_list): print('Variables {}: {}'.format(i, var)) |
After that, you can visualize this saved checkpoint through tensorboard. you just need to go to the directory where the checkpoints are saved open the terminal and run this command
1 |
tensorboard --logdir=checkpoints |
I hope this blog will help you to save the checkpoint and restore the checkpoint in session. Feel free to comment for any problems or suggestions. Also, You can follow me here for more blogs. Thanks for reading.