Software Architecture: Interview & Big Picture
Giới thiệu
Phần này tổng hợp big picture questions thường gặp trong system design interview và role tech lead, cùng framework để trả lời.
Focus: Không phải ôn lý thuyết, mà là cách think như architect — xem tradeoff, make decision, document lý do.
Question 1: Monolith vs Microservices
Context: "Chúng ta nên xây dựng system là monolith hay microservices?"
When to Choose Monolith ✅
Conditions:
- Team < 5 people
- Domain < 3 major contexts
- Deploy frequency: weekly hoặc ít hơn
- Low latency requirement
- Single database transaction OK
Advantages:
- ✓ Simple to develop, debug, deploy
- ✓ Single deployment unit → less operational overhead
- ✓ Easier to ensure ACID consistency
- ✓ Better performance (no network hop)
Disadvantages:
- ✗ Tight coupling as codebase grows
- ✗ Hard to scale individual modules
- ✗ Tech stack locked-in (mỗi team muốn dùng framework khác = mất công)
- ✗ One bug = whole system down
Example: MVP, startup trong 6 tháng đầu → monolith is right call.
Monolith
┌──────────────────────────────┐
│ Order Service │
│ Payment Service │
│ Inventory Service │
│ User Service │
│ │
│ Shared: Database, Cache │
└──────────────────────────────┘
When to Choose Microservices 🎯
Conditions:
- Team >= 5 (ideally, 2-pizza teams per service)
- Domain complexity >= 3-4 bounded contexts
- Different scaling needs per service (Payment hot, Reporting cold)
- Decoupled deployment + independent release cadence
- Multiple tech stacks needed
Advantages:
- ✓ Independent scaling, deployment
- ✓ Tech diversity: use Go for Order, Python for ML, Node for API
- ✓ Loose coupling: team autonomy
- ✓ Resilience: one service down ≠ whole system down
Disadvantages:
- ✗ Distributed system complexity: network calls, eventual consistency
- ✗ Operational overhead: service discovery, logging, monitoring
- ✗ Data consistency harder (no single transaction)
- ✗ Need strong team culture (else: chaos)
Example: Uber scale.
Microservices
┌─────────────────┐ ┌─────────────────┐
│ Order Service │ │ Payment Service │
└─────────────────┘ └─────────────────┘
↓ ↓
Order DB Payment DB
↓ ↓
[Event Bus / Message Queue]
↓
┌─────────────────┐ ┌─────────────────┐
│Inventory Service│ │ Reporting Service
└─────────────────┘ └─────────────────┘
The Monolith-to-Microservices Journey
Red flags to consider breaking to microservices:
- Deploy time > 20 min
- Different scaling needs: some services need 100 replicas, others 1
- Team friction: teams stepping on each other's toes
- Business need: time to market matters more than simplicity
Wrong approach: "We're scaling, so let's break into microservices now" → Often results in distributed monolith (worst of both worlds)
Right approach: Stay monolith as long as possible. When pain is obvious, break strategically:
- Identify candidate: Service with different scaling needs or separate team
- Create seam: Extract to library first, then separate service
- Async first: Use event-driven for inter-service comms
// Phase 1: Monolith with clear boundary
myapp/
├── order/ // Order context
├── payment/ // Payment context (candidate for extraction)
└── shared/
// Phase 2: Extract payment as library
payment-lib/ // Shared library
myapp/
├── order/
├── payment-client/ // Consumes payment-lib
// Phase 3: Separate service
payment-service/ // Independent service
myapp/
├── order/
├── payment-client/ // Makes gRPC calls to payment-service
// Note: Not: monolith splits cleanly. Usually messy.
Question 2: How to Evaluate Architecture?
Context: "Is our current architecture good?"
Evaluation Framework
1. Scalability
Question: Can we handle 10x traffic without rewrite?
---
Monolith: Horizontal scale is hard (must scale whole thing)
Microservices: Can scale hot services (Payment) independently
Metric: Response time / CPU under peak load
Red flag: CPU 100%, response time degrades non-linearly
2. Operational Complexity
Question: How hard is it to deploy, monitor, recover from failure?
---
Monolith: Easy (one process, one database)
Microservices: Hard (N databases, N services, N failure modes)
Metric: Mean time to recovery (MTTR)
Red flag: Average incident = 2 hours to resolve
3. Team Productivity
Question: Can each team work independently?
---
Monolith: Bottleneck (all teams in same repo)
Microservices: Autonomy (team owns service)
Metric: Deployment frequency, deployment size
Red flag: Can only deploy Friday afternoon, deploys affect multiple teams
4. Cost
Question: How much $ to run this system?
---
Monolith: 1 DB, 3 instances, 1 cache = low cost
Microservices: 5 DBs, 15 instances, 5 caches, load balancers, service mesh = 10x cost
Metric: Infrastructure cost per request / per user
Red flag: Cost grows faster than revenue
5. Technology Flexibility
Question: Can we adopt new tech without rewrite?
---
Monolith: Stuck with first framework choice
Microservices: Each service can be different
Metric: Can we experiment with new language/framework?
Red flag: Forced to use 10-year-old framework cuz system is too coupled
Scoring:
| Dimension | Monolith | Microservices |
|---|---|---|
| Scalability | 3/10 | 9/10 |
| Operational Complexity | 9/10 | 3/10 |
| Team Productivity | 5/10 | 9/10 |
| Cost | 9/10 | 3/10 |
| Tech Flexibility | 2/10 | 9/10 |
Total: Monolith = 28/50, Microservices = 33/50 → Choose based on your priorities.
Question 3: Architecture Decision Records (ADR)
Context: "Why did we choose X? Can we change it later?"
What is ADR?
ADR is a simple template to document architectural decision:
# ADR-001: Monolith for MVP
## Status
Accepted
## Context
- Team: 3 engineers
- Timeline: 6 months to MVP
- Budget: Limited
## Decision
We will start with monolithic architecture in Go.
## Rationale
1. Simplicity: Monolith faster to develop, deploy, debug
2. Team size: <5 devs, not need independent scaling yet
3. Consistency: ACID transactions important for order correctness
4. Operational: Single deployment, single database
## Consequences
- Good: Developer productivity high, operations simple
- Bad: Scaling limited to horizontal (add instances), will need refactor later
- Ugly: If domain grows, may become tightly coupled
## Alternatives considered
- Microservices: Overkill at this stage, high operational overhead
- Serverless: Good for certain functions, but complex for persistent data
## Revisit date
2025-Q2 (when traffic > 1000 req/sec or team > 5)
Benefits:
- Document "why" for future self
- Justify decision to stakeholders
- Clear revisit trigger
- Team alignment
Store ADRs:
- Repo:
docs/adr/folder - Format: Markdown
- Naming:
ADR-001-monolith.md,ADR-002-postgres-vs-mongodb.md
Question 4: Common Architecture Interview Questions
Q: "Design a high-scale notification system"
Framework to answer:
1. Clarify requirements
User → Ask: Peak QPS? Delivery latency? Reliability requirement?
Assume:
- 10M notifications / day
- 100K QPS peak
- < 5 sec delivery
- 99.9% reliability
- Channels: Email, SMS, Push
2. Identify bottlenecks
Naive: Service receives notification request → immediately send
Problem: If Email API is slow, blocks other notifications
Solution: Async queue
3. Design flow
Request
↓
Validation
↓
Enqueue (to RabbitMQ / Kafka)
↓ (async)
Workers (Email, SMS, Push)
↓
Retry logic (if fails)
4. Handle edge cases
- Notification failed: retry (exponential backoff)
- Duplicate: idempotent key (ID)
- Rate limit: token bucket
- Persistence: store in DB before sending (for replay)
5. Scale
- Multi-region: send region-local queue
- Partitioning: by user ID
- Circuit breaker: if Email API down, fail fast
Q: "Design payment processing system"
Framework:
1. Core flow
User → Select payment method → Charge → Verify → Confirm
↓ ↓ ↓ ↓ ↓
Validate Store method API call Check DB save
temporarily
2. Consistency
Must ensure: Order confirmed ⟺ Payment captured
Solution: Saga pattern or 2-phase commit
Saga (recommended):
1. Order service: create order (PENDING)
2. Payment service: charge card (via async event)
3. Order service: if success, change to CONFIRMED; else FAILED
4. Compensating transaction: if charge fails, release order
3. Idempotency
If network fails mid-charge, might retry.
Must not double-charge.
Solution: Idempotent key (orderID as key)
- First attempt: charge($100, key=order123) → success
- Retry: charge($100, key=order123) → returns same result (idempotent)
4. Security
- Never store card details (PCI compliance nightmare)
- Tokenize: Card → Payment Gateway → Token
- Pass token in subsequent requests
Q: "Design a collaborative document editor (like Google Docs)"
Framework:
1. Core: Real-time sync
User A types "hello" → Broadcast to User B (< 100ms)
User B types "world" → Broadcast to User A
Challenge: Concurrent edits
2. Handling concurrency
Without: User A changes position 0-5, User B changes position 0-3
→ Conflict! Who wins?
Solution: Operational Transform (OT) or CRDT (Conflict-free Replicated Data Type)
CRDT (simpler to understand):
- Each character gets unique ID (user_id + lamport_clock)
- Insert: character("h", id=user1.001)
- When sync, insert by ID order (always same order on all clients)
3. Architecture
┌─────────────────┐ ┌──────────────────┐
│ Browser (User A) │ Browser (User B)
└──────┬──────────┘ └────────┬─────────┘
│ │
└────────────┬───────────┘
↓
WebSocket Connection
↓
┌──────────────────┐
│ Collaboration │
│ Server (Node.js) │
└────────┬─────────┘
↓
┌───────────────┐
│ Redis (OT ops)│ (for real-time)
│ Postgres (doc)│ (for persistence)
└───────────────┘
4. Edge cases
- User offline: queue ops locally, sync when online
- Concurrent editing: OT algorithm handles it
- Persistence: save snapshot + ops log
- Undo/Redo: easy with op log
Question 5: Data Consistency Models
Consistency Spectrum
Strong Consistency ←→ Eventual Consistency
Strong: All reads see latest write (ACID)
E.g., Bank transfer: money visible immediately everywhere
Eventual: Reads may be stale, but converge to latest (BASE)
E.g., Instagram likes: count may lag by few seconds
Choose based on domain, not just preference.
When Strong Consistency:
- Financial systems: money must be accurate always
- Booking system: seat must be allocated atomically
- Inventory: stock count must be accurate
- Technique: Single DB with transactions, or distributed consensus (Raft)
When Eventual Consistency:
- Social media: likes, comments can lag
- Recommendations: data can be stale
- Reporting: data doesn't need to be real-time
- Technique: Event-driven, async processing, caching
Example:
// Strong consistency (monolith + single DB)
BEGIN TRANSACTION
INSERT INTO orders ...
UPDATE inventory SET qty = qty - 1 WHERE product_id = ?
COMMIT
// All systems see order + inventory update at same time
// Eventual consistency (event-driven)
1. Order service: INSERT INTO orders → publish OrderCreated event
2. Inventory service: subscribes to OrderCreated → updates inventory async
(If inventory write fails, retry later)
// Short window where order exists but inventory not updated
// But eventually converges
Question 6: When to Refactor Architecture
Red flags to refactor:
| Flag | Indicator |
|---|---|
| Complexity | New feature takes 3x longer than before |
| Reliability | Cascading failures (one service dies → everything) |
| Scaling | Can't handle peak load even with horizontal scale |
| Operational Nightmare | Deploy takes 1+ hour, deployments are scary |
| Team Churn | Engineers leaving cuz code is too messy |
| Data Consistency | Data inconsistency bugs increasing |
Refactor strategy (incremental, not big-bang):
Phase 1: Identify pain
→ Which service most problematic?
→ Which bounded context should be separate?
Phase 2: Extract seam
→ Create clear interface between this service and rest
→ Add integration tests
Phase 3: Dual-write
→ New code writes to both old + new service
→ New code reads from new service (fallback to old if fails)
Phase 4: Migrate traffic
→ Route 5% traffic to new service
→ Monitor, increase to 100%
Phase 5: Decommission
→ After stable, remove old service + dual-write
ANTI-PATTERN: "Let's rewrite everything from scratch" → 90% of rewrites fail. Incremental is safer.
Question 7: Technical Debt
Defining: Shortcuts taken today that cost tomorrow.
Technical Debt ≈ Financial Debt
- You borrow now (ship fast)
- You pay interest later (every change is harder)
- Eventually, you can't pay (system unmaintainable)
Types
1. Quick fix
// WRONG: Hardcoded logic to ship faster
func ProcessOrder(order *Order) {
if order.Total > 100 {
discount = 0.1
} else if order.Total > 50 {
discount = 0.05
}
// Couple months later: business says "loyalty members get 20% always"
// Now code is wrong, must refactor
}
// RIGHT: Config-driven
type DiscountPolicy struct {
Thresholds []struct {
MinAmount float64
Discount float64
}
}
// To change: update config, not code
2. Missing tests
Debt: Code works, but unverified
Cost: Next person touches it, breaks something → cascading failures
3. Lack of documentation
Debt: Code exists, but "why" unknown
Cost: Next person spends 2 days reverse-engineering
Managing Tech Debt
✓ Healthy approach:
- 10-20% dev time = refactoring / tech debt
- Document debt: "ADR-XXX: We hardcoded discount, revisit in 2 months"
- Review debt quarterly: is it getting paid down, or accumulating?
✗ Unhealthy:
- Zero tech debt: impossible, paralyzes development
- Unlimited tech debt: system becomes legacy nightmare
Summary Framework for Architecture Decisions
When facing architecture choice:
1. Clarify Constraints
- Team size?
- Timeline?
- Scale?
- Budget?
2. List Options
- Monolith / Microservices
- SQL / NoSQL
- Sync / Async
- Single DC / Multi-region
3. Evaluate Trade-offs
- Complexity?
- Cost?
- Scalability?
- Operational burden?
4. Make Decision
- Choose option with best trade-off for constraints
- Document in ADR (why, not just what)
- Set revisit date
5. Revisit
- Quarter / Half-year: is decision still valid?
- Constraints changed? Refactor incrementally
Tóm tắt
| Question | Framework |
|---|---|
| Monolith vs Microservices | Check team size, domain complexity, scaling needs |
| How to Evaluate | Score on: Scalability, Operational Complexity, Productivity, Cost, Tech Flexibility |
| Why Design Decision | Use ADR template: context, decision, rationale, consequences |
| Interview Design Question | Clarify → identify bottleneck → design → edge cases → scale |
| Data Consistency | Know strong vs eventual, pick based on domain |
| When Refactor | Red flags: complexity, reliability, scaling limit, operations nightmare |
| Tech Debt | Borrow now, pay later; manage with 10-20% refactoring time |
Bước tiếp theo
- Distributed Systems: Deep dive into consistency, consensus, failure modes
- System Design: Practice designing large-scale systems
- Leadership: How to make architecture decisions as a team