Data Warehouse là repository lưu structured, processed data được tổ chức theo schema cụ thể (star/snowflake schema) cho business intelligence và SQL analytics – data được ETL (Extract, Transform, Load) trước khi load vào.
Ví dụ: Amazon Redshift, Google BigQuery, Snowflake. Data Lake là repository lưu raw data ở bất kỳ format nào (structured, semi-structured, unstructured) ở quy mô massive – schema được áp dụng khi đọc (schema-on-read) thay vì khi ghi.
Ví dụ: AWS S3 + Glue + Athena, Azure Data Lake, Hadoop HDFS. Data Warehouse dùng khi: BI dashboards, regular business reports, data analysts cần SQL queries dễ dàng, data quality quan trọng. Data Lake dùng khi: data science và ML cần raw data, lưu trữ tất cả data để phân tích sau (không biết trước cần gì), log files, clickstream data. Data Lakehouse là trend mới (Databricks Delta Lake, Apache Iceberg) kết hợp cả hai: lưu raw data trong object storage nhưng có ACID transactions, schema enforcement, và query performance tốt như warehouse.
A Data Warehouse is a repository that stores structured, processed data organized under a defined schema (star or snowflake schema) for business intelligence and SQL analytics — data is ETL'd (Extracted, Transformed, Loaded) before ingestion.
- Examples: Amazon Redshift, Google BigQuery, Snowflake.
- A Data Lake is a repository that stores raw data in any format (structured, semi-structured, unstructured) at massive scale — schema is applied at read time (schema-on-read) rather than at write time.
- Examples: AWS S3 + Glue + Athena, Azure Data Lake, Hadoop HDFS.
- Use a Data Warehouse when: building BI dashboards, generating regular business reports, data analysts need easy SQL queries, or data quality is critical.
- Use a Data Lake when: data scientists and ML engineers need raw data, you want to store all data for future analysis (use cases unknown upfront), or ingesting log files and clickstream data.
- The Data Lakehouse is an emerging trend (Databricks Delta Lake, Apache Iceberg) that combines both: storing raw data in object storage while providing ACID transactions, schema enforcement, and warehouse-like query performance.