Chapter 16: Disk Storage, Basic File Structures, Hashing, and Modern Storage Architectures

Advanced Database Management Systems - Final Term Elite Preparation

1. Introduction & Storage Hierarchy

Databases are typically massive and must be stored efficiently on magnetic disks or SSDs. The DBMS accesses this data using underlying physical database file structures.

Data Types Based on Lifespan

The Storage Hierarchy

Storage is categorized by a strict trade-off: as you go down the hierarchy, capacity increases, cost per byte decreases, but access time heavily increases.

  1. Primary Storage: CPU Main Memory (RAM), Cache Memory (SRAM, DRAM). Fastest, volatile, highly expensive.
  2. Secondary (Mass) Storage: Magnetic disks (HDDs), Flash memory, Solid-state drives (SSDs), CD-ROMs, DVDs. Non-volatile, affordable mass storage.
  3. Tertiary Storage: Removable media, Tape Jukeboxes. Slowest, used for massive offline archives.

2. Secondary Storage Devices

Secondary storage dictates how records are physically placed and accessed.

Hard Disk Drives (HDD)

Solid State Devices (SSD / Flash Storage)

Magnetic Tape

3. Buffering & Efficient Data Access

Buffering minimizes the time the CPU spends waiting for slow disk I/O.

Double Buffering

A technique used to read a continuous stream of blocks. While the CPU processes data in Buffer A, the disk controller actively fills Buffer B. They swap roles instantly, allowing interleaved concurrency and parallel execution.

Buffer Management Metadata

Buffer Replacement Strategies

🔍 Micro-Detail: Techniques for Efficient Disk Access
Beyond standard buffering, modern systems speed up access via:
  • Read-ahead: Predicting and fetching data blocks before the CPU explicitly requests them.
  • I/O Scheduling: Reordering disk requests to minimize the physical movement of the disk head.
  • Log Disks: Using dedicated, high-speed disks to temporarily hold write operations.
  • Flash for Recovery: Utilizing non-volatile SSDs to securely hold logs for rapid crash recovery.

4. Placing File Records on Disk

🔍 Micro-Detail: Supported Data Types
Records consist of fields. Common types include Numeric, String, Boolean, and Date/time. For unstructured objects (like images, audio, or large text documents), databases use BLOBs (Binary Large Objects).

Variable-Length Records

Records are often not uniform in size. Reasons for variable-length records include:

Record Blocking: Spanned vs. Unspanned

📈 Formula: Blocking Factor (bfr)

The average number of records stored per disk block.

bfr = ⌊ B / R ⌋

Where B = Block Size, R = Record Size. (For unspanned records, you round down).

Allocating File Blocks on Disk

5. Basic File Organizations

🔍 Micro-Detail: Exact File Operations
  • Retrieval (No data change): Open, Find, Read, FindNext, Scan, Close.
  • Update (Changes data): Insert, Delete, Modify.

1. Heap (Pile) Files - Unordered

2. Sorted (Sequential) Files - Ordered

⏱ Average Access Times (where 'b' is total number of blocks):
File OrganizationSearch MethodAverage Access Time
Heap (Unordered)Linear Searchb / 2
OrderedLinear Scanb / 2
OrderedBinary Searchlog₂ b
🔍 Micro-Detail: Other Primary File Organizations
Aside from Heap and Sorted files, modern databases might use:
  • Files of Mixed Records: Implements relationships physically using logical field references.
  • B-Tree Data Structures: The industry-standard tree index for keeping data sorted and allowing fast searches/insertions.
  • Column-Based Storage: Stores data column-by-column rather than row-by-row, massively optimizing read-heavy analytical queries.

6. Hashing Techniques

Used when a group of records is accessed exclusively by one specific field value (Hash Field, typically the primary key). The search condition is an equality condition (=).

Dynamic File Expansion

Static hashing allocates a fixed number of buckets, leading to severe collisions as the database grows. Modern solutions include:

7. Parallelizing Disk Access Using RAID

RAID (Redundant Arrays of Independent Disks) aims to dramatically improve disk speed, access time, and system reliability.

Core Mechanics

🚨 Exam Focus: Master the RAID Levels
RAID LevelArchitecture & MechanismUse Case / Benefit
Level 0Data Striping across disks. No redundant data.Max speed, zero fault tolerance.
Level 1Mirroring. Exact copies on two disks.Ultimate reliability; rebuilding is easiest.
Level 2Memory-style redundancy using Hamming codes.Heavy error detection and correction.
Level 3Bit-interleaved with a single parity disk.Relies heavily on the disk controller.
Levels 4 & 5Block-level striping. RAID 5 distributes Parity across all disks.The industry standard. Highly preferred for large volume storage.
Level 6Applies P+Q dual redundancy scheme.Protects against up to TWO simultaneous disk failures.

8. Modern Storage Architectures

🔥 Core Theory Q&A Preparation

Ensure you can answer these clearly. They validate your foundational knowledge.

Concept: Buffer Management

Q: Explain the exact mechanism and purpose of Double Buffering.

A: Double buffering facilitates interleaved concurrency. By utilizing two independent memory buffers (A and B), the CPU can actively process the data residing in Buffer A. Simultaneously, the disk I/O controller fetches the next sequential block from the hard drive into Buffer B. Once the CPU finishes with A, it instantly switches to B, eliminating CPU idle time waiting for mechanical disk reads.

Concept: Record Organization

Q: When is 'Spanned' record organization mandatory, and what is its primary drawback?

A: Spanned organization is mandatory when a single logical database record's size exceeds the maximum size of a physical disk block. The drawback is increased access time: fetching a single spanned record requires the disk read/write head to access at least two separate disk blocks, effectively doubling the I/O cost for that record.

Concept: Dynamic Expansion

Q: How does Linear Hashing differ from Extendible Hashing in handling database growth?

A: Extendible Hashing manages growth by maintaining a separate, centralized array of pointers (a directory) that doubles in size when overflow occurs. Linear Hashing eliminates the need for this directory entirely. Instead, it uses a mathematical state-machine approach that smoothly splits buckets one by one in a linear fashion as the file's load factor increases.

Concept: Collision Resolution

Q: What is a Hash Collision, and what are the three primary ways to resolve it?

A: A collision occurs when the hash function mathematically assigns two distinct records to the exact same disk block (bucket) address. It is resolved by:
1. Open Addressing: Sequentially probing the disk for the next available empty slot.
2. Chaining: Creating a linked list pointing to separate overflow blocks.
3. Multiple Hashing: Applying a secondary, completely different hash function to find a new address.

🏆 10-Mark Scenario Questions

These complex scenarios require synthesizing multiple concepts. They mirror exactly what you will face in a high-weight university final exam.

Scenario 1: System Architecture

You are the DBA for a major hospital. The system requires storing millions of patient text records and massive MRI image files. The system must operate 24/7, cannot lose data, and must read data quickly. Hardware cost is a secondary concern. Design the physical storage layer specifying: (a) RAID Level, (b) Data Types, and (c) Storage Architecture. Justify each.

Elite Answer Formulation:

  • (a) RAID Level: I will implement RAID Level 1 or RAID Level 6. Since the hospital cannot afford data loss and cost is secondary, RAID 1 (Mirroring) provides the easiest and fastest rebuilding of data. Alternatively, RAID 6 provides dual P+Q parity, ensuring the system survives up to two simultaneous disk failures.
  • (b) Data Types: The standard patient text records will use standard Numeric, String, and Date types. However, the MRI images will explicitly be stored using BLOBs (Binary Large Objects), as they are massive unstructured data objects.
  • (c) Storage Architecture: I will use a combination of SAN (Storage Area Network) for high-speed, block-level transaction processing of patient vitals, coupled with Object-based Storage. Object-based storage is perfectly suited for highly scalable, unstructured data like MRI BLOBs, as it manages them via metadata and global identifiers rather than rigid file blocks.
Scenario 2: File Organization

An E-commerce platform has a massive 'Products' table. 95% of queries are exact-match searches based on the ProductID (e.g., SELECT * WHERE ProductID = 104). The product list grows by 10,000 items daily. Should the database store this table as a Heap file, a Sorted File, or using Hashing? Which specific hashing technique, if any? Justify your choices.

Elite Answer Formulation:

  • Rejection of Heap & Sorted: A Heap file is rejected because searching takes b/2 time, which will be disastrously slow for an e-commerce site. A Sorted file provides fast binary search (log₂ b), but inserting 10,000 new items daily would require constant, highly expensive block rewriting to maintain physical sort order.
  • Optimal Choice: Hashing. Hashing provides ultra-fast access time (usually 1 disk access) when the search condition is an equality condition on a key field (which matches the exact-match ProductID query requirement).
  • Specific Technique: Because the file expands rapidly (10,000 items daily), Static Hashing would fail due to massive collisions. I will implement Extendible Hashing or Linear Hashing. Linear Hashing is highly recommended here, as it allows the hash file to smoothly expand buckets without the overhead of maintaining a tree-structured directory.
Scenario 3: Buffer & I/O Bottlenecks

A financial database is experiencing severe latency. The CPU is frequently idling, waiting for disk reads. Upon inspection, the system is using single buffering, a FIFO replacement strategy, and contiguous block allocation. Propose four specific, distinct architectural changes to optimize this I/O bottleneck.

Elite Answer Formulation:

  1. Upgrade to Double Buffering: Switch from single to double buffering so the disk controller can pre-load Buffer B while the CPU actively processes Buffer A, enabling parallel execution and eliminating CPU idle time.
  2. Change Buffer Replacement to LRU or Clock: FIFO is inefficient for databases because it evicts blocks based strictly on age, potentially evicting highly used index blocks. LRU (Least Recently Used) will keep frequently accessed financial data pinned in RAM.
  3. Implement Read-Ahead (Pre-fetching): Configure the hardware controller to read data ahead of explicit requests, anticipating sequential financial reporting queries.
  4. Migrate to Indexed Block Allocation: Contiguous allocation causes severe fragmentation over time as financial records are inserted/deleted. Indexed allocation uses a dedicated index block to point to scattered data blocks, eliminating fragmentation while preserving read speed.
Scenario 4: Storage Tiering & Record Blocking

A social media app stores "Posts". A post can be a short 10-byte text, or a massive 15-Megabyte article with inline media. Currently, the system stores all posts on expensive SSDs using "Unspanned" record blocking. The CFO complains about storage costs, and the DBA complains about wasted disk space. How do you resolve both issues?

Elite Answer Formulation:

  • Resolving Wasted Disk Space (Blocking Strategy): The system must immediately switch from Unspanned to Spanned Records. Because post sizes are highly variable (10 bytes to 15 MB), Unspanned records enforce block boundaries. If a 15MB post cannot fit in the remaining space of a block, it leaves massive gaps of wasted space (internal fragmentation). Spanned records allow the 15MB post to cross block boundaries seamlessly using pointers, utilizing 100% of block space.
  • Resolving Storage Costs (Architecture Strategy): I will implement Automated Storage Tiering. It is a waste of money to store 5-year-old social media posts on expensive Flash/SSDs. Automated tiering will keep "Hot" (recent, frequently accessed) posts on the DRAM-based SSDs, and seamlessly migrate "Cold" (old, rarely viewed) data down to cheaper, slower HDDs or Tape archives without manual intervention.