⚙️ Software✍️ Khoa📅 20/04/2026☕ 12 phút đọc

Software Architecture: Interview & Big Picture

Giới thiệu

Phần này tổng hợp big picture questions thường gặp trong system design interview và role tech lead, cùng framework để trả lời.

Focus: Không phải ôn lý thuyết, mà là cách think như architect — xem tradeoff, make decision, document lý do.


Question 1: Monolith vs Microservices

Context: "Chúng ta nên xây dựng system là monolith hay microservices?"

When to Choose Monolith ✅

Conditions:

  • Team < 5 people
  • Domain < 3 major contexts
  • Deploy frequency: weekly hoặc ít hơn
  • Low latency requirement
  • Single database transaction OK

Advantages:

  • ✓ Simple to develop, debug, deploy
  • ✓ Single deployment unit → less operational overhead
  • ✓ Easier to ensure ACID consistency
  • ✓ Better performance (no network hop)

Disadvantages:

  • ✗ Tight coupling as codebase grows
  • ✗ Hard to scale individual modules
  • ✗ Tech stack locked-in (mỗi team muốn dùng framework khác = mất công)
  • ✗ One bug = whole system down

Example: MVP, startup trong 6 tháng đầu → monolith is right call.

Monolith
┌──────────────────────────────┐
│ Order Service                │
│ Payment Service              │
│ Inventory Service            │
│ User Service                 │
│                              │
│ Shared: Database, Cache      │
└──────────────────────────────┘

When to Choose Microservices 🎯

Conditions:

  • Team >= 5 (ideally, 2-pizza teams per service)
  • Domain complexity >= 3-4 bounded contexts
  • Different scaling needs per service (Payment hot, Reporting cold)
  • Decoupled deployment + independent release cadence
  • Multiple tech stacks needed

Advantages:

  • ✓ Independent scaling, deployment
  • ✓ Tech diversity: use Go for Order, Python for ML, Node for API
  • ✓ Loose coupling: team autonomy
  • ✓ Resilience: one service down ≠ whole system down

Disadvantages:

  • ✗ Distributed system complexity: network calls, eventual consistency
  • ✗ Operational overhead: service discovery, logging, monitoring
  • ✗ Data consistency harder (no single transaction)
  • ✗ Need strong team culture (else: chaos)

Example: Uber scale.

Microservices
┌─────────────────┐    ┌─────────────────┐
│ Order Service   │    │ Payment Service │
└─────────────────┘    └─────────────────┘
        ↓                      ↓
   Order DB              Payment DB
        ↓                      ↓
    [Event Bus / Message Queue]
        ↓
┌─────────────────┐    ┌─────────────────┐
│Inventory Service│    │ Reporting Service
└─────────────────┘    └─────────────────┘

The Monolith-to-Microservices Journey

Red flags to consider breaking to microservices:

  1. Deploy time > 20 min
  2. Different scaling needs: some services need 100 replicas, others 1
  3. Team friction: teams stepping on each other's toes
  4. Business need: time to market matters more than simplicity

Wrong approach: "We're scaling, so let's break into microservices now" → Often results in distributed monolith (worst of both worlds)

Right approach: Stay monolith as long as possible. When pain is obvious, break strategically:

  1. Identify candidate: Service with different scaling needs or separate team
  2. Create seam: Extract to library first, then separate service
  3. Async first: Use event-driven for inter-service comms
// Phase 1: Monolith with clear boundary
myapp/
├── order/          // Order context
├── payment/        // Payment context (candidate for extraction)
└── shared/

// Phase 2: Extract payment as library
payment-lib/       // Shared library
myapp/
├── order/
├── payment-client/ // Consumes payment-lib

// Phase 3: Separate service
payment-service/   // Independent service
myapp/
├── order/
├── payment-client/ // Makes gRPC calls to payment-service

// Note: Not: monolith splits cleanly. Usually messy.

Question 2: How to Evaluate Architecture?

Context: "Is our current architecture good?"

Evaluation Framework

1. Scalability

Question: Can we handle 10x traffic without rewrite?
---
Monolith: Horizontal scale is hard (must scale whole thing)
Microservices: Can scale hot services (Payment) independently

Metric: Response time / CPU under peak load
Red flag: CPU 100%, response time degrades non-linearly

2. Operational Complexity

Question: How hard is it to deploy, monitor, recover from failure?
---
Monolith: Easy (one process, one database)
Microservices: Hard (N databases, N services, N failure modes)

Metric: Mean time to recovery (MTTR)
Red flag: Average incident = 2 hours to resolve

3. Team Productivity

Question: Can each team work independently?
---
Monolith: Bottleneck (all teams in same repo)
Microservices: Autonomy (team owns service)

Metric: Deployment frequency, deployment size
Red flag: Can only deploy Friday afternoon, deploys affect multiple teams

4. Cost

Question: How much $ to run this system?
---
Monolith: 1 DB, 3 instances, 1 cache = low cost
Microservices: 5 DBs, 15 instances, 5 caches, load balancers, service mesh = 10x cost

Metric: Infrastructure cost per request / per user
Red flag: Cost grows faster than revenue

5. Technology Flexibility

Question: Can we adopt new tech without rewrite?
---
Monolith: Stuck with first framework choice
Microservices: Each service can be different

Metric: Can we experiment with new language/framework?
Red flag: Forced to use 10-year-old framework cuz system is too coupled

Scoring:

Dimension Monolith Microservices
Scalability 3/10 9/10
Operational Complexity 9/10 3/10
Team Productivity 5/10 9/10
Cost 9/10 3/10
Tech Flexibility 2/10 9/10

Total: Monolith = 28/50, Microservices = 33/50 → Choose based on your priorities.


Question 3: Architecture Decision Records (ADR)

Context: "Why did we choose X? Can we change it later?"

What is ADR?

ADR is a simple template to document architectural decision:

# ADR-001: Monolith for MVP

## Status
Accepted

## Context
- Team: 3 engineers
- Timeline: 6 months to MVP
- Budget: Limited

## Decision
We will start with monolithic architecture in Go.

## Rationale
1. Simplicity: Monolith faster to develop, deploy, debug
2. Team size: <5 devs, not need independent scaling yet
3. Consistency: ACID transactions important for order correctness
4. Operational: Single deployment, single database

## Consequences
- Good: Developer productivity high, operations simple
- Bad: Scaling limited to horizontal (add instances), will need refactor later
- Ugly: If domain grows, may become tightly coupled

## Alternatives considered
- Microservices: Overkill at this stage, high operational overhead
- Serverless: Good for certain functions, but complex for persistent data

## Revisit date
2025-Q2 (when traffic > 1000 req/sec or team > 5)

Benefits:

  • Document "why" for future self
  • Justify decision to stakeholders
  • Clear revisit trigger
  • Team alignment

Store ADRs:

  • Repo: docs/adr/ folder
  • Format: Markdown
  • Naming: ADR-001-monolith.md, ADR-002-postgres-vs-mongodb.md

Question 4: Common Architecture Interview Questions

Q: "Design a high-scale notification system"

Framework to answer:

1. Clarify requirements

User → Ask: Peak QPS? Delivery latency? Reliability requirement?
Assume:
- 10M notifications / day
- 100K QPS peak
- < 5 sec delivery
- 99.9% reliability
- Channels: Email, SMS, Push

2. Identify bottlenecks

Naive: Service receives notification request → immediately send
Problem: If Email API is slow, blocks other notifications

Solution: Async queue

3. Design flow

Request
  ↓
Validation
  ↓
Enqueue (to RabbitMQ / Kafka)
  ↓ (async)
  Workers (Email, SMS, Push)
  ↓
Retry logic (if fails)

4. Handle edge cases

- Notification failed: retry (exponential backoff)
- Duplicate: idempotent key (ID)
- Rate limit: token bucket
- Persistence: store in DB before sending (for replay)

5. Scale

- Multi-region: send region-local queue
- Partitioning: by user ID
- Circuit breaker: if Email API down, fail fast

Q: "Design payment processing system"

Framework:

1. Core flow

User → Select payment method → Charge → Verify → Confirm
       ↓           ↓              ↓        ↓        ↓
     Validate   Store method  API call   Check    DB save
               temporarily

2. Consistency

Must ensure: Order confirmed ⟺ Payment captured
Solution: Saga pattern or 2-phase commit

Saga (recommended):
  1. Order service: create order (PENDING)
  2. Payment service: charge card (via async event)
  3. Order service: if success, change to CONFIRMED; else FAILED
  4. Compensating transaction: if charge fails, release order

3. Idempotency

If network fails mid-charge, might retry.
Must not double-charge.

Solution: Idempotent key (orderID as key)
- First attempt: charge($100, key=order123) → success
- Retry: charge($100, key=order123) → returns same result (idempotent)

4. Security

- Never store card details (PCI compliance nightmare)
- Tokenize: Card → Payment Gateway → Token
- Pass token in subsequent requests

Q: "Design a collaborative document editor (like Google Docs)"

Framework:

1. Core: Real-time sync

User A types "hello"  → Broadcast to User B (< 100ms)
User B types "world"  → Broadcast to User A

Challenge: Concurrent edits

2. Handling concurrency

Without: User A changes position 0-5, User B changes position 0-3
→ Conflict! Who wins?

Solution: Operational Transform (OT) or CRDT (Conflict-free Replicated Data Type)

CRDT (simpler to understand):
- Each character gets unique ID (user_id + lamport_clock)
- Insert: character("h", id=user1.001)
- When sync, insert by ID order (always same order on all clients)

3. Architecture

┌─────────────────┐    ┌──────────────────┐
│  Browser (User A)     │  Browser (User B)
└──────┬──────────┘    └────────┬─────────┘
       │                        │
       └────────────┬───────────┘
                    ↓
          WebSocket Connection
                    ↓
          ┌──────────────────┐
          │ Collaboration    │
          │ Server (Node.js) │
          └────────┬─────────┘
                   ↓
           ┌───────────────┐
           │ Redis (OT ops)│ (for real-time)
           │ Postgres (doc)│ (for persistence)
           └───────────────┘

4. Edge cases

- User offline: queue ops locally, sync when online
- Concurrent editing: OT algorithm handles it
- Persistence: save snapshot + ops log
- Undo/Redo: easy with op log

Question 5: Data Consistency Models

Consistency Spectrum

Strong Consistency ←→ Eventual Consistency

Strong: All reads see latest write (ACID)
        E.g., Bank transfer: money visible immediately everywhere

Eventual: Reads may be stale, but converge to latest (BASE)
         E.g., Instagram likes: count may lag by few seconds

Choose based on domain, not just preference.

When Strong Consistency:

- Financial systems: money must be accurate always
- Booking system: seat must be allocated atomically
- Inventory: stock count must be accurate
- Technique: Single DB with transactions, or distributed consensus (Raft)

When Eventual Consistency:

- Social media: likes, comments can lag
- Recommendations: data can be stale
- Reporting: data doesn't need to be real-time
- Technique: Event-driven, async processing, caching

Example:

// Strong consistency (monolith + single DB)
BEGIN TRANSACTION
  INSERT INTO orders ...
  UPDATE inventory SET qty = qty - 1 WHERE product_id = ?
COMMIT

// All systems see order + inventory update at same time

// Eventual consistency (event-driven)
1. Order service: INSERT INTO orders → publish OrderCreated event
2. Inventory service: subscribes to OrderCreated → updates inventory async
   (If inventory write fails, retry later)

// Short window where order exists but inventory not updated
// But eventually converges

Question 6: When to Refactor Architecture

Red flags to refactor:

Flag Indicator
Complexity New feature takes 3x longer than before
Reliability Cascading failures (one service dies → everything)
Scaling Can't handle peak load even with horizontal scale
Operational Nightmare Deploy takes 1+ hour, deployments are scary
Team Churn Engineers leaving cuz code is too messy
Data Consistency Data inconsistency bugs increasing

Refactor strategy (incremental, not big-bang):

Phase 1: Identify pain
  → Which service most problematic?
  → Which bounded context should be separate?

Phase 2: Extract seam
  → Create clear interface between this service and rest
  → Add integration tests

Phase 3: Dual-write
  → New code writes to both old + new service
  → New code reads from new service (fallback to old if fails)

Phase 4: Migrate traffic
  → Route 5% traffic to new service
  → Monitor, increase to 100%

Phase 5: Decommission
  → After stable, remove old service + dual-write

ANTI-PATTERN: "Let's rewrite everything from scratch" → 90% of rewrites fail. Incremental is safer.


Question 7: Technical Debt

Defining: Shortcuts taken today that cost tomorrow.

Technical Debt ≈ Financial Debt
- You borrow now (ship fast)
- You pay interest later (every change is harder)
- Eventually, you can't pay (system unmaintainable)

Types

1. Quick fix

// WRONG: Hardcoded logic to ship faster
func ProcessOrder(order *Order) {
    if order.Total > 100 {
        discount = 0.1
    } else if order.Total > 50 {
        discount = 0.05
    }
    // Couple months later: business says "loyalty members get 20% always"
    // Now code is wrong, must refactor
}

// RIGHT: Config-driven
type DiscountPolicy struct {
    Thresholds []struct {
        MinAmount  float64
        Discount   float64
    }
}

// To change: update config, not code

2. Missing tests

Debt: Code works, but unverified
Cost: Next person touches it, breaks something → cascading failures

3. Lack of documentation

Debt: Code exists, but "why"  unknown
Cost: Next person spends 2 days reverse-engineering

Managing Tech Debt

✓ Healthy approach:

  • 10-20% dev time = refactoring / tech debt
  • Document debt: "ADR-XXX: We hardcoded discount, revisit in 2 months"
  • Review debt quarterly: is it getting paid down, or accumulating?

✗ Unhealthy:

  • Zero tech debt: impossible, paralyzes development
  • Unlimited tech debt: system becomes legacy nightmare

Summary Framework for Architecture Decisions

When facing architecture choice:

1. Clarify Constraints
   - Team size?
   - Timeline?
   - Scale?
   - Budget?

2. List Options
   - Monolith / Microservices
   - SQL / NoSQL
   - Sync / Async
   - Single DC / Multi-region

3. Evaluate Trade-offs
   - Complexity?
   - Cost?
   - Scalability?
   - Operational burden?

4. Make Decision
   - Choose option with best trade-off for constraints
   - Document in ADR (why, not just what)
   - Set revisit date

5. Revisit
   - Quarter / Half-year: is decision still valid?
   - Constraints changed? Refactor incrementally

Tóm tắt

Question Framework
Monolith vs Microservices Check team size, domain complexity, scaling needs
How to Evaluate Score on: Scalability, Operational Complexity, Productivity, Cost, Tech Flexibility
Why Design Decision Use ADR template: context, decision, rationale, consequences
Interview Design Question Clarify → identify bottleneck → design → edge cases → scale
Data Consistency Know strong vs eventual, pick based on domain
When Refactor Red flags: complexity, reliability, scaling limit, operations nightmare
Tech Debt Borrow now, pay later; manage with 10-20% refactoring time

Bước tiếp theo

  • Distributed Systems: Deep dive into consistency, consensus, failure modes
  • System Design: Practice designing large-scale systems
  • Leadership: How to make architecture decisions as a team