⚙️ Software✍️ Khoa📅 20/04/2026☕ 9 phút đọc

Distributed Systems: Raft và Leader Election

5. Raft — Consensus dễ hiểu

Diego Ongaro & John Ousterhout (2014). "In Search of an Understandable Consensus Algorithm."

5.1 Thiết kế Philosophy

Raft được thiết kế để dễ hiểu, không chỉ đúng:
  Decompose consensus thành 3 sub-problems:
    1. Leader Election
    2. Log Replication
    3. Safety

  Strong leader: tất cả decisions đi qua leader
  → Simpler flow, dễ reason about
  → Trade-off: leader là bottleneck

5.2 Terms — Logical Clock của Raft

Term = logical time, monotonically increasing integer

  Term 1        Term 2     Term 3
  [Leader A]   [election]  [Leader B]
  ──────────────────────────────────────► time

  Mỗi term bắt đầu bằng election
  Nếu election thất bại (split vote) → new term ngay

  Term number được gửi trong mọi RPC
  Nếu node nhận RPC với term > own term:
    → Update own term
    → Chuyển về Follower state
  Nếu node nhận RPC với term < own term:
    → Reject RPC (stale message)

  Term phục vụ như "epoch" để detect stale leaders

5.3 Leader Election — Chi tiết

Node States:
  Follower   → nhận và forward requests
  Candidate  → đang cố gắng become leader
  Leader     → handle all client requests

State Machine:
  Follower ──(election timeout)──► Candidate ──(win majority)──► Leader
     ▲                                │                              │
     └──────────────────────────────── ┴──────────────────────────────┘
     (discover higher term)     (discover higher term / another leader)

Election Timeout:
  Mỗi Follower có randomized timeout: 150-300ms (Raft paper)
  Nếu không nhận heartbeat từ Leader trong timeout:
    → Increment term
    → Chuyển thành Candidate
    → Vote cho chính mình
    → Gửi RequestVote đến tất cả nodes

RequestVote RPC:
  Arguments:  term, candidateId, lastLogIndex, lastLogTerm
  Response:   term, voteGranted (true/false)

  Acceptor grant vote NẾU:
    1. candidate.term >= own.term (không vote cho outdated candidate)
    2. Chưa vote cho ai trong term này (hoặc đã vote cho candidate này)
    3. Candidate's log >= own log (Election Restriction!):
       candidate.lastLogTerm > own.lastLogTerm
       OR (candidate.lastLogTerm == own.lastLogTerm
           AND candidate.lastLogIndex >= own.lastLogIndex)
       → Đảm bảo leader có tất cả committed entries!

Split Vote:
  Nếu không ai đạt majority → timeout → new election (new term)
  Randomized timeout → hiếm khi split vote liên tục

5.4 Log Replication — Chi tiết

Log Entry structure:
  ┌─────────────────────────────────────────────┐
  │  index | term | command                     │
  │    1   |  1   | SET x=1                     │
  │    2   |  1   | SET y=2                     │
  │    3   |  2   | DEL x                       │
  │    4   |  2   | SET z=3 (uncommitted)        │
  └─────────────────────────────────────────────┘

Normal Operation (Leader L, Followers F1, F2):

Step 1 — Client request:
  Client → L: "SET x=5"
  L: append entry (index=5, term=2, cmd="SET x=5") to local log

Step 2 — AppendEntries RPC (heartbeat + log replication):
  L → F1, F2: AppendEntries {
    term: 2,
    leaderId: L,
    prevLogIndex: 4,      ← index của entry trước
    prevLogTerm: 2,       ← term của entry trước
    entries: [(5, 2, "SET x=5")],
    leaderCommit: 4       ← leader's commitIndex
  }

Step 3 — Followers append:
  F1 kiểm tra: prevLogIndex=4, prevLogTerm=2 match? ✅
  F1: append entry vào log
  F1 → L: AppendEntries Response { term: 2, success: true }

Step 4 — Commit:
  L nhận success từ majority (F1 đủ → 2/3 nodes)
  L: commitIndex = 5
  L: apply entry to state machine
  L → Client: OK

Step 5 — Followers commit:
  Next AppendEntries: leaderCommit = 5
  F1, F2: commitIndex = min(leaderCommit, last log index) = 5
  F1, F2: apply to state machine

5.5 Log Consistency — Raft Guarantee

Log Matching Property:
  Nếu 2 logs có entry với cùng index và term
  → Tất cả entries trước đó PHẢI giống nhau

Tại sao? AppendEntries consistency check:
  prevLogIndex, prevLogTerm phải match
  → Nếu match at index i → match at i-1, i-2, ...
  → Inductive proof

Phát hiện inconsistency:
  F nhận AppendEntries với prevLogIndex=4, prevLogTerm=2
  F's log tại index 4 có term=1 (khác!) → fail!
  → L decrements nextIndex[F] → thử prevLogIndex=3
  → Repeat cho đến khi match
  → Overwrite conflicting entries

  Leader's log là source of truth
  Followers phải match leader's log

5.6 Safety — Committed Entries Never Lost

Raft Safety Property:
  Nếu entry được committed tại term T
  → Tất cả future leaders PHẢI có entry đó

Proof bằng Election Restriction:
  Entry (index=5, term=2) committed = accepted bởi majority M

  New leader election:
  → Candidate phải nhận votes từ majority M'
  → M và M' có overlap (ít nhất 1 node chung)
  → Node chung có log với entry committed
  → Candidate không thể có log "less up-to-date"
     (Election Restriction từ chối vote)
  → Candidate PHẢI có entry committed ✅

Vì vậy:
  Leader có tất cả committed entries
  → Committed entries không bao giờ mất
  → Followers chỉ có thể bị overwrite bởi leader
     (bằng AppendEntries consistency check)

5.7 Membership Changes — Joint Consensus

Vấn đề: Thêm node vào cluster trong khi đang chạy

Naive approach: thay đổi config ngay → NGUY HIỂM!
  Cluster cũ: {A, B, C} — majority = 2
  Cluster mới: {A, B, C, D, E} — majority = 3

  Trong transition:
    A, B có new config: majority(new) = D+E needed
    C vẫn có old config: majority(old) = just A or B
    → A và C có thể cùng elect leaders!
    → SPLIT BRAIN!

Joint Consensus (Raft paper approach):
  Phase 1: Commit C_old,new (union của cả hai configs)
    → Mọi decisions cần majority của CŨNG C_old VÀ C_new
    → Không thể split brain: cả hai majorities cần agree

  Phase 2: Commit C_new
    → Cluster fully transitioned

Single-server changes (simpler, widely used):
  Chỉ thêm/bớt 1 node mỗi lần
  → Không thể split thành 2 equal halves
  → Safe without joint consensus
  → etcd, CockroachDB dùng approach này

5.8 Log Compaction — Snapshots

Vấn đề: Log grow mãi → không thể replay từ đầu mỗi khi restart

Snapshot:
  Tại một commitIndex → serialize entire state machine
  Lưu snapshot + (lastIncludedIndex, lastIncludedTerm)
  Xóa log entries trước lastIncludedIndex

  ┌──────────────────────────────────────────────────┐
  │ Snapshot (state at index 100)                    │
  │ lastIncludedIndex=100, lastIncludedTerm=3         │
  │ state: {x:5, y:10, z:3, ...}                     │
  └──────────────────────────────────────────────────┘
  [101: SET a=1][102: SET b=2][103: DEL z]  ← remaining log

InstallSnapshot RPC:
  Leader gửi snapshot đến slow follower
  (follower quá chậm, leader đã discard log entries follower cần)
  Follower: replace entire state với snapshot

6. So sánh Paxos vs Raft

                      Paxos (Multi-Paxos)      Raft
─────────────────────────────────────────────────────────────────
Thiết kế mục tiêu     Correctness              Understandability
Leader                Implicit (any proposer)  Explicit strong leader
Log holes             Có thể (out-of-order)    Không (sequential)
Membership change     Không specify            Joint consensus / single-server
Log compaction        Không specify            Snapshots (built-in)
Term/Epoch            Ballot number            Term (explicit)
Election              Không specify riêng      Randomized timeout
Complexity            Cao (nhiều variants)     Thấp hơn
Implementations       Chubby (Google), Zab     etcd, CockroachDB, TiKV, Consul
Throughput            Có thể cao hơn           Slightly lower (strong leader)

Multi-Paxos variants:
  Classic Paxos      (Lamport)
  Fast Paxos         (skip phase 1 in some cases)
  Cheap Paxos        (k+1 nodes để tolerate k failures, k active)
  Byzantine Paxos    (tolerate Byzantine failures)
  EPaxos             (Egalitarian Paxos — leaderless, commutative commands)

Raft variants:
  Standard Raft      (Ongaro)
  Pre-vote           (ngăn disruptive elections)
  Leader stickiness  (prefer current leader)
  MultiRaft          (nhiều Raft groups — TiKV)

7. Leader Election

7.1 Properties cần đảm bảo

Safety (quan trọng hơn):
  Tại bất kỳ thời điểm nào → tối đa 1 leader trong cluster
  (có thể 0 leader trong khi đang elect)
  Vi phạm → SPLIT BRAIN → data corruption!

Liveness:
  Cuối cùng phải elect được 1 leader
  (FLP impossibility → chỉ guarantee probabilistically)

7.2 Bully Algorithm

Mỗi node có unique ID. Node ID lớn nhất = leader.

Khi node P phát hiện leader down:
  P gửi ELECTION message đến tất cả nodes có ID > P

  Nếu không ai trả lời:
    P tự declare làm leader
    P gửi COORDINATOR message đến tất cả nodes có ID < P

  Nếu node Q (ID > P) trả lời:
    Q tiếp quản election
    P dừng (đã có node "mạnh hơn" handle)

Ví dụ: nodes {1, 2, 3, 4, 5}, node 5 (leader) crash

Node 3 detect leader down:
  3 → ELECTION → {4, 5}
  4 → ELECTION → {5}
  4: 5 không trả lời → 4 wins
  4 → COORDINATOR → {1, 2, 3}
  4 trở thành leader

Vấn đề:
  O(n²) messages
  Không phù hợp với large clusters
  "Bully" vì node lớn hơn luôn thắng

7.3 Ring Election (Chang-Roberts Algorithm)

Nodes được tổ chức thành logical ring
Mỗi node chỉ communicate với neighbor

Khi election bắt đầu:
  Node gửi ELECTION(id) theo chiều kim đồng hồ

  Nhận ELECTION(id):
    IF id > own_id: forward ELECTION(id) (node mạnh hơn đang compete)
    IF id < own_id: replace với ELECTION(own_id) (tôi mạnh hơn)
    IF id == own_id: TÔI LÀ LEADER! gửi COORDINATOR

O(n) messages average, O(n²) worst case

7.4 ZooKeeper Leader Election

ZooKeeper cung cấp primitives để build leader election:

Approach 1 — Ephemeral Node:
  Tất cả candidates cố gắng CREATE /election/leader (ephemeral, sequential=false)
  Node tạo thành công → LEADER
  Nodes khác WATCH /election/leader
  Khi leader crash → /election/leader bị delete (ephemeral)
  → Watch triggered → tất cả retry create → new leader elected

  Vấn đề: Herd effect — N-1 nodes wake up cùng lúc → thundering herd

Approach 2 — Sequential Ephemeral (recommended):
  Mỗi candidate CREATE /election/node- (ephemeral, SEQUENTIAL)
  → ZooKeeper assign: /election/node-0000000001, -0000000002, ...

  Node có số nhỏ nhất → LEADER
  Node không phải leader → WATCH node ngay trước mình
  (không watch leader trực tiếp → tránh herd effect)

  Nếu /election/node-0000000001 (leader) die:
    → Chỉ node-0000000002 nhận notification
    → node-0000000002 trở thành leader (smallest)
    → O(1) notifications thay vì O(n)

7.5 Epoch-based Election

Dùng trong Raft, Kafka, Elasticsearch:

Concept:
  Mỗi "era" của leadership = epoch (hay term, generation)
  Epoch là monotonically increasing integer
  Leader mới = epoch mới

  Nodes reject messages từ epoch cũ:
    "Tôi đang ở epoch 5, sao anh gửi message epoch 3?"
    → Reject (stale leader)

Kafka Controller Election:
  ZooKeeper /controller: ephemeral node với current controller ID
  Candidate race to create /controller
  Winner → controller cho epoch này
  KRaft mode (Kafka 3.x): ZooKeeper-free, dùng Raft built-in

Elasticsearch Master Election:
  Minimum master nodes: phải có ≥ ceil(N/2)+1 eligible masters
  Prevent split-brain: minority partition không thể elect master
  Voting: dựa trên node với highest cluster state version