Save and Restore sessions as checkpoints in TensorFlow

Akanksha Gupta

29 July 2021

How to Save and Restore sessions model as checkpoint

In this post, we are going to talk about checkpoints, how to save them, and restore them for training.

Checkpoints

Checkpoints are saved the graph of all parameters that are used by a model. checkpoints do not contain a description of the computation defined by the model. checkpoints are binary files in a proprietary format that map variable names to tensor values. The best way to examine the contents of a checkpoint is to load it using a Saver.

Saver

The Saver class helps you to save the variables and restore them. Savers can automatically number checkpoint filenames with a provided counter by an option global_step. This lets you keep multiple checkpoints at different steps while training a model. For example, you can number the checkpoint filenames with the training step number. To avoid filling up disks, savers manage checkpoint files automatically. For example, they can keep only the N most recent files, or one checkpoint for every N hours of training.

Save

class CFModel(object):

  def __init__(self, embedding_vars, loss, metrics=None):
    self._embedding_vars = embedding_vars
    self._loss = loss
    self._metrics = metrics
    self._embeddings = {k: None for k in embedding_vars}
    self._session = None
<strong>    self.checkpoint_dir = "./checkpoints"+storeid+"/"
    self.saver = tf.train.Saver()</strong>

    with self._session.as_default():
      load =True
      local_init_op.run()
      iterations = []
      metrics = self._metrics or ({},)
      metrics_vals = [collections.defaultdict(list) for _ in self._metrics]

      # Train and append results.
      for i in range(num_iterations + 1):
        _, results = self._session.run((train_op, metrics))
        
        if (i % 10 == 0) or i == num_iterations:
          print("\r iteration %d: " % i + ", ".join(
                ["%s=%f" % (k, v) for r in results for k, v in r.items()]),
                end='')
          iterations.append(i)
          for metric_val, result in zip(metrics_vals, results):
            for k, v in result.items():
              metric_val[k].append(v)
        <strong>self.saver.save(self._session, self.checkpoint_dir + 'model.ckpt')
        
</strong>

class CFModel(object):

def __init__(self, embedding_vars, loss, metrics=None):

self._embedding_vars = embedding_vars

self._loss = loss

self._metrics = metrics

self._embeddings = {k: None for k in embedding_vars}

self._session = None

<strong> self.checkpoint_dir = "./checkpoints"+storeid+"/"

self.saver = tf.train.Saver()</strong>

with self._session.as_default():

load =True

local_init_op.run()

iterations = []

metrics = self._metrics or ({},)

metrics_vals = [collections.defaultdict(list) for _ in self._metrics]

# Train and append results.

for i in range(num_iterations + 1):

_, results = self._session.run((train_op, metrics))

if (i % 10 == 0) or i == num_iterations:

print("\r iteration %d: " % i + ", ".join(

["%s=%f" % (k, v) for r in results for k, v in r.items()]),

end='')

iterations.append(i)

for metric_val, result in zip(metrics_vals, results):

for k, v in result.items():

metric_val[k].append(v)

<strong>self.saver.save(self._session, self.checkpoint_dir + 'model.ckpt')

</strong>

after running this code you can find saved checkpoint in your specified directory.

.data: Contains variable values

.meta: Contains graph structure

.index: Identifies checkpoints

restore

All the variables that you need to save in the disk, you can load your saved variables in the session using saver.restore(). we can load through this function and call this function inside your session object.

  def _load_(self, sess, checkpoint_dir = None):
      if checkpoint_dir:
          self.checkpoint_dir = checkpoint_dir

      print("loading a session")
      ckpt = tf.train.get_checkpoint_state(self.checkpoint_dir)
      if ckpt and ckpt.model_checkpoint_path:
          self.saver.restore(sess, ckpt.model_checkpoint_path)
          # for i, var in enumerate(self.saver._var_list):
          #   print('Variables {}: {}'.format(i, var))
      else:
          print("no checkpoint found")
      return

def _load_(self, sess, checkpoint_dir = None):

if checkpoint_dir:

self.checkpoint_dir = checkpoint_dir

print("loading a session")

ckpt = tf.train.get_checkpoint_state(self.checkpoint_dir)

if ckpt and ckpt.model_checkpoint_path:

self.saver.restore(sess, ckpt.model_checkpoint_path)

# for i, var in enumerate(self.saver._var_list):

# print('Variables {}: {}'.format(i, var))

else:

print("no checkpoint found")

return

In addition, you can see the saved variables by this code as above mentioned.

for i, var in enumerate(self.saver._var_list):
  print('Variables {}: {}'.format(i, var))

1 2	for i, var in enumerate(self.saver._var_list): print('Variables {}: {}'.format(i, var))

After that, you can visualize this saved checkpoint through tensorboard. you just need to go to the directory where the checkpoints are saved open the terminal and run this command

tensorboard --logdir=checkpoints

1	tensorboard --logdir=checkpoints

I hope this blog will help you to save the checkpoint and restore the checkpoint in session. Feel free to comment for any problems or suggestions. Also, You can follow me here for more blogs. Thanks for reading.

How to Save and Restore sessions model as checkpoint

Checkpoints

Saver

Save

restore

Reference Links :