Storage and Retrieval

Turning Persistent Data into Accessible Knowledge

Databases exist because information is valuable only if it can be stored and retrieved when needed. As organizations grow, the volume of information increases, and simply writing data to disk is no longer sufficient. The challenge is no longer merely preserving information, but finding the right information quickly enough to support decisions and operations.

The simplest way to store data is to append every change to a log. This approach is simple and efficient for writing because new information is always added to the end. However, retrieving a particular record becomes increasingly expensive because the entire history must be scanned to locate it. As data grows, this linear search becomes impractical.

To make retrieval faster, databases introduce indexes. An index is an additional structure derived from the underlying data that acts like a table of contents, allowing records to be located without scanning everything. Indexes improve read performance dramatically, but they introduce an important trade-off. Every write operation must now update both the data and its indexes, making writes slower. There is no free performance; improving reads always comes at some cost to writes.

Because different applications have different access patterns, storage engines have evolved around different optimization goals. Transactional systems, which support applications such as banking, e-commerce, and social media, prioritize fast responses for many small reads and writes. Analytical systems, which support reporting and business intelligence, prioritize scanning enormous amounts of data efficiently.

Within transactional systems, two major design philosophies emerged. One approach is to update data in place using structures such as B-Trees. B-Trees organize data into fixed-size pages and maintain ordered indexes that provide predictable and efficient reads. Since modifying pages directly risks corruption during failures, these systems rely on write-ahead logs to ensure durability. Their strength lies in delivering consistent and low-latency query performance.

The alternative approach is based on logs and immutable files. Log-Structured Merge Trees first collect writes in memory and periodically flush them to disk as sorted files. Over time, these files are merged and obsolete versions are removed. Because writes are largely sequential, LSM-Trees achieve extremely high write throughput. Structures such as Bloom filters help avoid unnecessary disk lookups and improve read efficiency. In essence, B-Trees continuously maintain order, whereas LSM-Trees postpone the work of organization and perform it later through merging.

As workloads shifted from transactions to analytics, another challenge emerged. Analytical queries often examine billions of rows but only a few columns. Storing entire rows together wastes resources because most of the data being read is irrelevant. Column-oriented storage addresses this problem by storing each column separately. This allows queries to read only the required attributes, significantly reducing disk access.

Columnar layouts also enable powerful compression because values within the same column are often similar. Techniques such as bitmap encoding and run-length encoding reduce storage requirements and improve performance. Modern analytical engines further accelerate processing by operating on batches of values or compiling queries into machine code, minimizing computational overhead.

Not all information is one-dimensional. Some problems involve multiple dimensions or more complex notions of similarity. Geospatial applications require searching by latitude and longitude simultaneously, leading to multidimensional indexes such as R-Trees. Text retrieval systems focus on keywords and employ inverted indexes to locate documents efficiently. More recently, semantic search systems have emerged, where documents are represented as vector embeddings and similarity is determined by distance in high-dimensional space. Specialized vector indexes, such as HNSW, allow machines to search for meaning rather than exact words.

At the same time, declining memory costs have made in-memory databases increasingly practical. By keeping data structures directly in RAM, these systems avoid the overhead of repeatedly transforming memory structures into disk-oriented formats. Their performance advantage comes not only from eliminating disk access but also from eliminating unnecessary encoding and decoding.

These developments reveal that storage and retrieval are not solved by a single universal mechanism. Every storage engine embodies trade-offs between read speed, write speed, compression, scalability, and complexity. Transactional systems and analytical systems optimize for fundamentally different objectives, and specialized indexes emerge whenever new kinds of queries become important.

Ultimately, databases are not merely repositories of information. They are mechanisms for balancing competing demands so that information remains useful at scale. Understanding the principles behind logs, indexes, storage engines, and specialized retrieval methods allows engineers to choose technologies that align with the needs of their applications rather than treating all databases as interchangeable tools.

In the end, storage and retrieval are about much more than saving bytes. They are about transforming persistent data into accessible knowledge. The evolution from append-only logs to indexes, B-Trees, LSM-Trees, columnar stores, and vector databases reflects a deeper truth: as information grows, the ability to retrieve the right information efficiently becomes just as important as the ability to store it in the first place.

Checkout my new book here: https://ankit-rathi.github.io/store/