⚙️ Software✍️ Khoa📅 20/04/2026☕ 7 phút đọc

Distributed Systems: Fundamentals và Consensus Basics

1. Fundamentals — Tại sao Distributed Systems khó

1.1 The Eight Fallacies of Distributed Computing

Peter Deutsch (Sun Microsystems, 1994) — 8 giả định SAI khi lập trình distributed systems:

1. The network is reliable
   → Packets bị drop, connections timeout, NIC fail

2. Latency is zero
   → Cross-datacenter: ~100ms, cross-continent: ~200ms

3. Bandwidth is infinite
   → Saturated links, backpressure, throttling

4. The network is secure
   → MITM, eavesdropping, DDoS

5. Topology doesn't change
   → IPs thay đổi, nodes added/removed, DNS changes

6. There is one administrator
   → Multiple teams, cloud providers, third-party services

7. Transport cost is zero
   → Serialization, encryption, network cost

8. The network is homogeneous
   → Mix của TCP, UDP, HTTP/1.1, HTTP/2, gRPC...

1.2 Hai vấn đề cốt lõi

Problem 1 — Partial Failure:
  Trong single machine: fail → toàn bộ fail (rõ ràng)
  Trong distributed system: một số nodes fail, số khác chạy bình thường
  → Không biết remote node đang fail hay chỉ là network chậm!

  "Is the server down, or just the network between us?"
  → Impossible to distinguish without timeout

Problem 2 — Unreliable Timing:
  Mỗi machine có clock riêng
  Clock drift: hardware clock không chính xác tuyệt đối
  → 2 events xảy ra "cùng lúc" ở 2 nodes
  → Không có global "now"
  → Không thể dùng timestamp để order events

1.3 Network Partitions trong thực tế

Không phải chỉ "all nodes down":
  - Partial partition: A ↔ B OK, A ↔ C OK, nhưng B ↔ C FAIL
  - Asymmetric partition: A→B OK, nhưng B→A FAIL (firewalls!)
  - Slow network: không phải fail, nhưng quá chậm → timeout

Kết quả:
  - Node B nghĩ C đã down
  - Node C nghĩ B đã down
  - Cả hai cố gắng promote mình làm leader
  → Split-brain!

Real-world examples:
  AWS us-east-1 outage 2011: EBS + EC2 partition
  GitHub 2012: MySQL master/slave partition → dual masters
  Jepsen tests: nhiều "ACID" databases thực ra không ACID dưới partition

2. Failure Models

Crash-Stop (Fail-Stop):
  Node fail → dừng hoàn toàn, không gửi messages nữa
  Dễ detect (heartbeat timeout)
  Assumption của hầu hết consensus protocols

Crash-Recovery:
  Node fail → có thể restart sau đó
  Phải handle state recovery từ stable storage (disk)
  Raft, Paxos assume crash-recovery

Omission Failure:
  Node không gửi/nhận một số messages
  Send omission: gửi nhưng message không đến
  Receive omission: message đến nhưng không xử lý

Timing Failure:
  Node trả lời nhưng quá chậm
  Violates timing assumptions

Byzantine Failure (worst case):
  Node có thể gửi bất kỳ message nào (kể cả incorrect/malicious)
  Crash + lie + collude với nodes khác
  → Cần Byzantine Fault Tolerant (BFT) algorithms
  → PBFT, Tendermint (blockchain)
  → Cần ≥ 3f+1 nodes để tolerate f Byzantine nodes
  → Thực tế: hầu hết internal systems assume non-Byzantine

Fault Tolerance:
  Crash-Stop:  cần ≥ 2f+1 nodes để tolerate f failures (majority)
  Byzantine:   cần ≥ 3f+1 nodes để tolerate f failures

3. FLP Impossibility

Fischer, Lynch, Paterson (1985) — Bài báo nổi tiếng nhất CS lý thuyết:

Không có deterministic consensus algorithm nào có thể đảm bảo terminate trong asynchronous system với ngay cả 1 node có thể fail.

3.1 Giải thích trực quan

Asynchronous system = không có timing guarantees
  Không biết message bị delay hay node đã crash
  Không có timeout hợp lý (vì không có upper bound trên latency)

Consensus yêu cầu:
  Agreement:   tất cả non-faulty nodes quyết định cùng giá trị
  Validity:    giá trị được quyết định phải được propose bởi một node
  Termination: tất cả non-faulty nodes eventually quyết định

FLP: trong asynchronous system + 1 crash failure → không thể đảm bảo cả 3

Tại sao?
  Luôn tồn tại execution path mà system "bivalent" mãi mãi
  (có thể đi đến 0 hoặc 1) → không thể commit quyết định

  Nếu bị delay một message đủ lâu → giống như node crash
  → System không thể phân biệt → không thể proceed safely

3.2 Practical Implications

Paxos/Raft KHÔNG vi phạm FLP vì:
  Chúng dùng randomization hoặc partial synchrony
  → "Probabilistically terminate" không phải "always terminate"
  → Liveness (termination) là probabilistic, không guaranteed

  Raft: dùng randomized election timeout
  → Rất khó (nhưng không impossible) để không bao giờ elect leader

Thực tế: "Asynchronous" là worst case model
  Internet thực tế có upper bounds trên latency (partial synchrony)
  → Paxos/Raft hoạt động tốt trong thực tế

Lesson: Trade-off giữa Safety (correctness) và Liveness (progress)
  Raft/Paxos chọn Safety: thà không progress còn hơn sai
  → "It is better to be safe than live"

4. Paxos — Consensus gốc

Leslie Lamport (1989, published 1998). Foundation của hầu hết consensus protocols.

4.1 Roles trong Paxos

Proposer:  Đề xuất values, drive consensus
Acceptor:  Vote và accept proposals (quorum-based)
Learner:   Học kết quả cuối cùng để apply

Trong thực tế: mỗi node đóng cả 3 roles

4.2 Basic Paxos (Single-decree)

Chỉ đạt consensus về một value duy nhất.

Phase 1A — Prepare:
  Proposer chọn proposal number n (globally unique, monotonic)
  Proposer → tất cả Acceptors: Prepare(n)

Phase 1B — Promise:
  Acceptor nhận Prepare(n):
    IF n > highest_prepare_seen:
      highest_prepare_seen = n
      → Promise(n, accepted_value, accepted_n)
         (accepted_value = value đã accept trước đó, nếu có)
    ELSE:
      → Nack (ignore hoặc reject)

Phase 2A — Accept:
  Proposer nhận Promise từ majority (quorum):
    IF bất kỳ Promise nào có accepted_value:
      → value = accepted_value với accepted_n cao nhất
         (PHẢI dùng value này, không thể tự chọn!)
    ELSE:
      → value = proposer's own value (tự chọn)
    Proposer → tất cả Acceptors: Accept(n, value)

Phase 2B — Accepted:
  Acceptor nhận Accept(n, value):
    IF n >= highest_prepare_seen:
      accepted_n = n
      accepted_value = value
      → Accepted(n, value) → Proposers và Learners
    ELSE:
      → Nack

Consensus đạt được:
  Khi Learner nhận Accepted(n, value) từ majority
  → value đã được chosen!

4.3 Ví dụ cụ thể

3 nodes: A, B, C. Proposer P muốn propose value="X"

--- Phase 1: Prepare ---
P → A, B, C: Prepare(n=1)
A → P: Promise(n=1, accepted_value=nil)
B → P: Promise(n=1, accepted_value=nil)
(C không trả lời — down)

P nhận majority (A+B) → proceed

--- Phase 2: Accept ---
Không có prior accepted_value → P dùng "X"
P → A, B, C: Accept(n=1, value="X")
A → P: Accepted(n=1, value="X")
B → P: Accepted(n=1, value="X")

Consensus: value="X" ✅

--- Concurrent Proposer Q muốn propose "Y" ---
Q → A, B, C: Prepare(n=2)  (n=2 > n=1)
A → Q: Promise(n=2, accepted_value="X", accepted_n=1)
B → Q: Promise(n=2, accepted_value="X", accepted_n=1)

Q nhận accepted_value="X" → PHẢI propose "X" (không thể propose "Y"!)
→ Paxos đảm bảo safety: đã chosen "X" → không thể thay đổi

4.4 Multi-Paxos

Basic Paxos: consensus về 1 value → không đủ cho replicated log.

Multi-Paxos = nhiều instances của Paxos, mỗi instance = 1 log slot

Optimization: elect stable Leader → skip Phase 1 cho mọi instances

Normal operation với stable leader:
  Leader → Acceptors: Accept(slot=42, n=current_term, value="SET x=1")
  Acceptors → Leader: Accepted
  → Phase 1 chỉ cần 1 lần khi leader mới elected!
  → Sau đó: chỉ 1 RTT per log entry (Phase 2 only)

Slots:
  slot 1: "SET x=1"   (chosen)
  slot 2: "SET y=2"   (chosen)
  slot 3: "DEL x"     (in-progress)
  slot 4: "SET z=3"   (in-progress)

Gaps:
  Nếu slot 3 không được chosen → slot 4 có thể chosen trước
  → "Holes" trong log
  → Phải fill holes trước khi apply

4.5 Tại sao Paxos khó?

1. Multi-Paxos không fully specified (Lamport để nhiều thứ mở)
   → Mỗi implementation phải tự decide nhiều details

2. Leader election không được define
   → Phải build separately

3. Log compaction không được define

4. Membership changes không được define
   (thêm/bớt nodes trong cluster)

5. Liveness không guaranteed (Dueling proposers):
   Proposer 1: Prepare(n=1)
   Proposer 2: Prepare(n=2) → preempts P1
   Proposer 1: Prepare(n=3) → preempts P2
   → Cứ thế mãi, không ai commit được
   Giải pháp: chỉ cho phép 1 leader propose (Multi-Paxos)
              randomized backoff