Denormalization là gì? Khi nào nên denormalize database?

Denormalization là cố ý làm ngược chuẩn hóa: chấp nhận lưu trùng dữ liệu để khỏi phải JOIN tốn kém — đổi "đọc nhanh hơn" lấy "ghi chậm hơn + tốn chỗ + rủi ro lệch dữ liệu". Chỉ làm khi đã đo đạc thấy cần, không phải tối ưu sớm.

Vài pattern phổ biến:
- Cột nhân bản: lưu orders.user_email dù đã có users.email → báo cáo khỏi JOIN.
- Cột tính sẵn (cached): users.post_count cập nhật khi thêm/xóa bài, thay vì COUNT(*) mỗi lần.
- Materialized view: kết quả tính sẵn, làm mới định kỳ — CREATE MATERIALIZED VIEW + REFRESH ... CONCURRENTLY.
- Snapshot: nhúng địa chỉ vào order tại thời điểm mua (lịch sử cần ảnh chụp lúc đó, không phải địa chỉ hiện tại).

Điểm chốt: mỗi lần ghi phải đồng bộ lại dữ liệu trùng (qua trigger hoặc code). Chỉ nên dùng khi tỉ lệ đọc/ghi rất cao (vd > 100:1), query báo cáo nhiều JOIN, và đã thêm index mà vẫn không đạt yêu cầu thời gian phản hồi.

Denormalization is deliberately going against normalization: accepting duplicated data to skip expensive JOINs — trading "faster reads" for "slower writes + more storage + risk of inconsistency". Do it only after measurements show a need, not as premature optimization.

A few common patterns:
- Duplicated column: store orders.user_email even though users.email exists → reports skip the JOIN.
- Cached computed column: users.post_count updated on post insert/delete instead of COUNT(*) every time.
- Materialized view: a precomputed result set, refreshed periodically — CREATE MATERIALIZED VIEW + REFRESH ... CONCURRENTLY.
- Snapshot: embed the address into the order at purchase time (history needs the snapshot then, not the current address).

Key point: every write must re-sync the duplicated data (via triggers or code). Use it only when the read/write ratio is very high (e.g. > 100:1), reports involve many JOINs, and adding indexes still didn't meet the response-time requirement.

Xem toàn bộ Database cùng filter theo level & chủ đề con.

Mở danh sách Database