Generation Clock

A monotonically increasing number indicating the generation of the server.

04 August 2020

aka: Term, Epoch, and Generation

Problem

In Leader and Followers setup, there is a possibility of the leader being temporarily disconnected from the followers. There might be a garbage collection pause in the leader process, or a temporary network disruption which disconnects the leader from the follower. In this case the leader process is still running, and after the pause or the network disruption is over, it will try sending replication requests to the followers. This is dangerous, as meanwhile the rest of the cluster might have selected a new leader and accepted requests from the client. It is important for the rest of the cluster to detect any requests from the old leader. The old leader itself should also be able to detect that it was temporarily disconnected from the cluster and take necessary corrective action to step down from leadership.

Solution

Maintain a monotonically increasing number indicating the generation of the server. Every time a new leader election happens, it should be marked by increasing the generation. The generation needs to be available beyond a server reboot, so it is stored with every entry in the Write-Ahead Log. As discussed in High-Water Mark, followers use this information to find conflicting entries in their log.

At startup, the server reads the last known generation from the log.

class ReplicationModule…

  this.replicationState = new ReplicationState(config, wal.getLastLogEntryGeneration());

With Leader and Followers servers increment the generation every time there's a new leader election.

class ReplicationModule…

  private void startLeaderElection() {
      replicationState.setGeneration(replicationState.getGeneration() + 1);
      registerSelfVote();
      requestVoteFrom(followers);
  }

The servers send the generation to other servers as part of the vote requests. This way, after a successful leader election, all the servers have the same generation. Once the leader is elected, followers are told about the new generation

follower (class ReplicationModule...)

  private void becomeFollower(Long generation) {
      replicationState.setGeneration(generation);
      transitionTo(ServerRole.FOLLOWING);
  }

Thereafter, the leader includes the generation in each request it sends to the followers. It includes it in every HeartBeat message as well as the replication requests sent to followers.

Leader persists the generation along with every entry in its Write-Ahead Log

leader (class ReplicationModule...)

  Long appendToLocalLog(byte[] data) {
      var logEntryId = wal.getLastLogEntryId() + 1;
      var logEntry = new WALEntry(logEntryId, data, EntryType.DATA, replicationState.getGeneration());
      return wal.writeEntry(logEntry);
  }

This way, it is also persisted in the follower log as part of the replication mechanism of Leader and Followers

If a follower gets a message from a deposed leader, the follower can tell because its generation is too low. The follower then replies with a failure response.

follower (class ReplicationModule...)

  Long currentGeneration = replicationState.getGeneration();
  if (currentGeneration > replicationRequest.getGeneration()) {
      return new ReplicationResponse(FAILED, serverId(), currentGeneration, wal.getLastLogEntryId());
  }

When a leader gets such a failure response, it becomes a follower and expects communication from the new leader.

Old leader (class ReplicationModule...)

  if (!response.isSucceeded()) {
      stepDownIfHigherGenerationResponse(response);
      return;
  }

  private void stepDownIfHigherGenerationResponse(ReplicationResponse replicationResponse) {
      if (replicationResponse.getGeneration() > replicationState.getGeneration()) {
          becomeFollower(replicationResponse.getGeneration());
      }
  }

Consider the following example. In the three server cluster, leader1 is the existing leader. All the servers in the cluster have the generation as 1. Leader1 sends continuous heartbeats to the followers. Leader1 has a long garbage collection pause, for say 5 seconds. The followers did not get a heartbeat, and timeout to elect a new leader. The new leader increments the generation to 2. After the garbage collection pause is over, leader1 continues sending the requests to other servers. The followers and the new leader which are at generation 2, reject the request and send a failure response with generation 2. leader1 handles the failure response and steps down to be a follower, with generation updated to 2.

Figure 1: Generation

Examples

Raft

Raft uses the concept of a Term for marking the leader generation.

Zab

In Zookeeper, an epoch number is maintained as part of every transaction id. So every transaction persisted in Zookeeper has a generation marked by epoch.

Cassandra

In Cassandra each server stores a generation number which is incremented every time a server restarts. The generation information is persisted in the system keyspace and propagated as part of the gossip messages to other servers. The servers receiving the gossip message can then compare the generation value it knows about and the generation value in the gossip message. If the generation in the gossip message is higher, it knows that the server was restarted and then discards all the state it has maintained for that server and asks for the new state.

Epoch's in Kafka

In Kafka an epoch number is created and stored in Zookeeper every time a new Controller is elected for a kafka cluster. The epoch is included in every request that is sent from controller to other servers in the cluster. Another epoch called LeaderEpoch is maintained to know if the followers a partition are lagging behind in their High-Water Mark.

Significant Revisions