The Great Data Shift: OLTP, OLAP, Data Warehouses, and the Rise of Data Lakes
Table of contents
No headings in the article.
Back in the days when businesses needed systems to handle transactional data such as purchases and sales, they began using online transactional processes. When transactions occur online, they are highly concurrent, meaning thousands of simultaneous transactions may take place at the same time while maintaining isolation to prevent conflicts. These transactions are executed immediately, and updates are applied in real time. Each transaction is either fully completed (committed) or entirely rolled back in case of failure, ensuring atomicity. Once a transaction is committed, it is permanently stored, guaranteeing durability. The database remains in a valid state before and after each transaction, maintaining consistency.
These systems incorporate backup mechanisms and replication strategies to ensure high availability and fault tolerance, preventing data loss and enabling recovery from failures. Data in OLTP systems is normalized to eliminate redundancy and enhance integrity, meaning there is no duplicate data, and the system is optimized for storage and consistency. OLTP systems are designed for low latency, ensuring real-time updates and fast transaction processing.
As businesses grew, the demand for business intelligence and data analytics increased. OLTP was not designed for analytical queries such as aggregation, summarization, or joining data from multiple sources. This led to the emergence of online analytical processing (OLAP). OLAP is optimized for large-scale queries through denormalized data structures (e.g., star or snowflake schemas), which minimize the need for complex joins between tables, making queries more efficient. This denormalization allows data to be stored in a way that aggregates and dimensions are precomputed or easily accessible, leading to faster retrieval. OLAP is read-heavy because it is designed to handle complex, large-scale analytical queries where data is primarily read and aggregated rather than frequently written. These optimizations enable OLAP systems to efficiently perform extensive aggregations, summaries, and multi-dimensional analysis.
To support this new need, data warehouses were introduced. Data was cleaned, transformed, and structured before being transferred, and techniques such as indexing, partitioning, and columnar storage were employed to optimize queries. This became a paradise for data analysts.
However, data sources and raw data soon became more diverse—not just OLTP data. Information began flowing in from social media, log files, and sensor data. The days of easily manageable data were long gone!
The big data era had begun, and data warehouses struggled to handle raw, unstructured data and the massive volumes being generated. Traditional data warehouses were costly in terms of storage and compute resources. They required predefined schemas, making them inflexible in handling evolving data structures. ELT for data warehouses was time-consuming, and the overhead cost businesses a fortune. And what about real-time data? Traditional data warehouses simply weren’t designed for that.
Then came data lakes to the rescue. However, they quickly became disorganized. Managing access control, compliance, and security became increasingly difficult. The biggest challenge of all was preventing data lakes from turning into data swamps due to a lack of proper data governance. Later, technologies like Delta Lake emerged to introduce structure, reliability, and ACID transactions to data lakes, addressing many of these challenges.