What is database

A database is a storage location that houses structured data.

Mostly used for transaction, and has good performance for both reading and writing, but just for a small amount of data.

Popular databases are:

  • Oracle
  • PostgreSQL
  • MongoDB
  • Redis
  • Elasticsearch
  • Apache Cassandra

Learn more about the key difference in databases: SQL vs NoSQL (coming later).

What is data warehouse (DWH)

Data warehouses are large storage locations for data that you accumulate from a wide range of sources. More specifically, the process of creating a DWH can be seen as moving raw data input via Extract-Transform-Load (ETL) actions into a consolidated storage system to be used for analysis.

Mostly used for data analysis, and only has good performance for reading, but good for large amount of data. Use denormalization and thus having redundant data for speed purpose.

It also uses multiple instance for query, distributed based on primary key.

Popular data warehouse are:

  • Snowflake
  • Yellowbrick
  • Teradata
  • Amazon RedShift

What is data lake

A data lake is a large storage repository that holds a huge amount of raw data in its original format until you need it.
The use cases for data lakes are generally limited to data science research and testing—so the primary users of data lakes are data scientists and engineers. For a company that actually builds data warehouses, for instance, the data lake is a place to dump and temporarily store all the data until the data warehouse is up and running.

This is essentially a “do-it-yourself” version fo a data warehouse.

Data lakes are often built with a combination of open source and closed source technologies, making them easy to customize and able to handle increasingly complex workflows. Image courtesy of Lior Gavish/Monte Carlo.