Database Indexing under the hood

Nader Medhat
Towards Dev
Published in
14 min readMar 13, 2023

--

Database indexing is a technique to improve the speed and efficiency of database queries. It creates a separate data structure that maps the values in one or more columns of a table to their corresponding physical storage locations, allowing the database to quickly locate the rows that match a particular query without having to scan the entire table. Different types of indexes can be used, but they can take up space and must be updated when data is modified. It is important to carefully consider the indexing strategy for a database and optimize it regularly.

How Databases create indexes

https://www.guru99.com/indexes.html

Database indexing is typically created by using a specific algorithm that determines how the index will be constructed and stored. The exact process for creating an index can vary depending on the type of database system being used, but the basic steps involved are generally as follows:

  1. Identify the column or columns in the database table that will be indexed. This is typically based on the columns that are most frequently used in queries or searches.
  2. Choose an indexing algorithm that is appropriate for the type of data being indexed. For example, B-tree indexes are commonly used for indexing string or numeric data, while full-text indexes are used for indexing text data.
  3. Apply the indexing algorithm to the selected columns, which creates a data structure that maps the column values to the locations of the corresponding records in the table.
  4. Store the index in a separate data structure, typically on a different part of the disk or in memory, so that it can be accessed more efficiently than the underlying table data.
  5. Update the index whenever new records are added, deleted, or modified in the table.

Creating an index can significantly improve the performance of database queries and searches, as it allows the database system to locate the relevant records more quickly and efficiently. However, indexing can also have some drawbacks, such as increased storage requirements and slower performance for insert and update operations, so it’s important to carefully consider the trade-offs before creating an index.

Indexing algorithms

as described above there are a lot of indexing algorithms used to optimize the speed of data retrieval operations by creating indexes on the columns of tables. Here are some of the commonly used database indexing algorithms:

  • B-Tree
  • Bitmap Index
  • Hash Index
  • GiST (Generalized Search Tree)
  • Full-Text Index

Each indexing algorithm has its strengths and weaknesses let’s describe each one of them.

B-Tree

https://www.javatpoint.com/b-tree

Defentaion

B-Tree is a self-balancing tree data structure that is commonly used as an indexing algorithm in databases. Each node in the tree comprises a set of keys and pointers to child nodes, storing data in a hierarchical structure. The B-nodes trees are set up in a way that makes it possible to search, insert, and delete data quickly.

The biggest benefit of the B-Tree algorithm is minimizing the number of disk I/O operations required to access data, and that’s because in a B-Tree, all leaf nodes are at the same level, and each node can hold multiple keys and pointers. The number of keys and pointers that can be stored in a node is determined by a parameter called the “order” of the tree.

How it works

The B-Tree algorithm works as follows:

  1. Initialization: An empty root node is created when a B-Tree is built. A parameter that sets the maximum number of keys “order” that can be stored in each node controls the B-order. Tree’s
  2. Insertion: When a new key is added to the B-Tree, the algorithm first looks for the proper leaf node to put the key into. The B-Tree divides a leaf node that is full into two new nodes and moves the median key to the parent node. Until the root node is reached, the splitting process might propagate up the tree. By following this procedure, the tree is kept balanced and its leaf nodes are all at the same height.
  3. Deletion: When a key is removed from a B-Tree, the algorithm looks for the node that originally held the key. If a leaf node contains the key, the key is taken out and the node may need to be rebalanced. The algorithm deletes the predecessor or successor from the leaf node after replacing the key with it if the key is discovered in a non-leaf node.
  4. Search: While looking for a key in a B-Tree, the algorithm starts at the root node and moves recursively up the tree until it finds the right leaf node. The search method compares the search key to the keys contained in each node and then uses the proper pointer to navigate to the child node where the key might be. This process continues until the search key is discovered or it is determined that the key is not present in the tree.

but B-trees have some drawbacks:

  • Overhead: B-trees have a significant amount of overhead in terms of space usage becue Each node in a B-tree contains a pointer to its parent and child node.
  • Complexity: The algorithms used for inserting, deleting, and searching data in a B-tree are relatively complex compared to other data structures, This complexity makes it difficult to implement and maintain B-trees.
  • Slow updates: Updating data in a B-tree can be relatively slow compared to other data structures. Each update operation requires a number of disk accesses, which can be slow for large B-trees.

Bitmap indexing

https://wiki.scn.sap.com/wiki/display/EIM/Bitmap+Indexes

Definition

Bitmap indexing is a data indexing technique that uses bitmaps to represent the existence or non-existence of a value in a table. It is a successful indexing approach for tables with low cardinality columns, where the number of distinct values in a column is quite modest in relation to the total number of rows.

Bitmap indexing can be very efficient for low cardinality columns because the bitmaps are very compact and can be quickly scanned to retrieve data. Bitmap indexes are very beneficial for data warehouse applications where it is necessary to swiftly scan enormous amounts of data. They are also helpful for databases that see a lot of read-only activity but not a lot of updates or insertions.

How it works

we will describe how the bitmap index works in more detail.

  • To create a bitmap index for a column, a separate bitmap is created for each distinct value in the column. Each bitmap has a length equal to the number of rows in the table.
  • If the value exists in a row, the corresponding bit in the bitmap is set to 1, and if it does not exist, the bit is set to 0 (Think of a table where the “Gender” column has the two separate values “Male” and “Female,” for instance. Two bitmaps would be produced if this column had a bitmap index, with each having a length equal to the number of rows in the table. When a “Male” or “Female” appears in a row, the appropriate bit in the “Male” or “Female” bitmap is set to 1, and vice versa. he corresponding bits in both bitmaps are set to 0 if a row is missing either the “Male” or “Female” value.)
  • To perform a query using a bitmap index, the bitmaps corresponding to the values in the query are combined using bitwise operators such as AND, OR, and NOT. (e.g if we want to find all rows where “Gender” is “Male” AND “Age” is greater than 30, we would first retrieve the “Male” bitmap and the “Age > 30” bitmap from their respective indexes. We would then combine the two bitmaps using the bitwise AND operator to get a final bitmap with 1s only in the positions where both conditions are true. The final bitmap is then used to retrieve the rows from the table that satisfy the query)

Bitmap indexes have several drawbacks, which include:

  • Large size: Bitmap indexes can be large, especially when dealing with large datasets. This can make them less efficient than other indexing methods.
  • High cardinality columns: Bitmap indexes are not efficient for high-cardinality columns, where the number of distinct values is very high. In such cases, the bitmap indexes can become very large and may not fit in memory.
  • Skewed data distribution: If the data is skewed, where a few values have a much higher frequency than others, bitmap indexes may not be efficient. This is because the bitmaps for the most frequent values can become very large and may dominate the index.

Hash Index

https://stackoverflow.com/questions/29436034/hash-index-vs-inverted-index

Definition

A hash index is a type of database indexing technique that uses a hash function to map the index keys to the locations of the corresponding data records. It is a fast indexing method for exact match queries on a single column.

the Hash index is a fast indexing technique for exact match queries on a single column. It uses a hash function to map the index keys to the locations of the corresponding data records and allows for fast lookups and inserts in constant time O(1). However, it does not work well with range queries or partial matches and may suffer from collisions, which can be handled using various collision resolution techniques.

How it works

To explain how a hash index works, let’s take an example. Suppose we have a database table that contains information about employees, including their user ID numbers. We want to create a hash index on the user ID column to allow for fast lookups of user data based on their ID number.

1 — we would create a hash function that takes a user ID as input and generates a unique hash code as output. The hash function should be designed to generate a uniformly distributed set of hash codes so that the records are evenly distributed across the buckets in the index file. In practice, the hash function may use different techniques such as modulo arithmetic or bitwise operations to generate the hash code.

2 — we would create a hash index file that consists of a set of buckets, each of which corresponds to a unique hash code generated by the hash function. Each bucket contains a pointer to the database file that contains the records with that hash code.

3 — When a query is performed, the hash function is applied to the query value to generate the hash code. The hash code is then used to locate the corresponding bucket in the hash index file. The records with the same hash code are stored in the same bucket, so we can simply scan the records in that bucket to find the matching record(s). If there are collisions (i.e., multiple records with the same hash code), we can use techniques such as chaining or open addressing to resolve them.

4 — To insert a new record into the hash index, we apply the hash function to the record’s key value to generate the hash code, and then insert the record into the appropriate bucket in the hash index file. If there are no collisions, the insertion can be performed in constant time O(1), as we only need to compute the hash code and insert the record into the bucket. If there are collisions, we may need to perform additional operations such as inserting the record into a linked list in the bucket, or probing other buckets until a free slot is found.

Hash indexes also have several drawbacks, which include:

  • Limited search capabilities: Hash indexes are designed to handle only equality searches (i.e., “find all records where column A equals a specific value”). They are not well-suited for range queries or sorting.
  • Collisions: Hash indexes can have collisions, where multiple keys map to the same hash value. This can result in degraded performance, as the database must perform additional operations to resolve the collisions.
  • Unpredictable storage requirements: The size of a hash index cannot be predicted in advance, as it depends on the number of unique values in the indexed column. This can make it difficult to plan for storage requirements.

GiST

https://awide.io/gist-index-and-siglen/

Definition

GiST (Generalized Search Tree) is a type of database indexing technique that can be used to index complex data types, such as geometric objects, text, or arrays. it’s is a balanced tree structure that consists of nodes with multiple children. Each node represents a range or a set of values and is associated with a predicate function that tests whether a given value belongs to the range or set. The predicate function is specific to the data type being indexed and can be customized for different types of data.

How it works

To illustrate how a GiST index works, let’s consider an example of indexing spatial data. Suppose we have a database table that contains information about cities, including their names and coordinates in the form of longitude and latitude.

1 — we would define a set of predicates and mapping functions that are specific to the spatial data type being indexed. In this case, we might define a predicate that tests whether a given point lies within a bounding box represented by a node in the index and a mapping function that transforms a point into a set of keys based on its position in the bounding box.

2 — we would create a GiST index file that consists of a set of nodes, each of which represents a bounding box that covers a range of coordinates.
The root node represents the entire range of coordinates in the database table, and each child node represents a subset of the range.
Each node is associated with a predicate function and a mapping function that are specific to the spatial data type being indexed.

3 — When a query is performed, the query value is transformed into a set of keys using the mapping function.
The keys are then compared with the predicates associated with each node in the index, starting from the root node.
The search proceeds down the tree by selecting the child node that contains the query value.
The process is repeated until a leaf node is reached, which contains the index entries corresponding to the query value.

4 — To insert a new city into the index, the coordinates of the city are first transformed into a set of keys using the mapping function.
The keys are then inserted into the appropriate nodes in the index, starting from the root node.
If the node is full, a split operation is performed to create two new nodes, and the keys are redistributed among the nodes.

GiST has a few drawbacks to consider:

  1. Slower insert and update performance: GiST indexing structures can be more complex than traditional indexing structures, which can lead to slower insert and update performance.
  2. More disk space: GiST indexing structures may require more disk space than other indexing techniques because they store additional information to support various types of searches.
  3. Not suitable for all data types: GiST is optimized for indexing complex data types such as spatial data, but may not be the best choice for indexing simpler data types such as integers or strings.
  4. Higher maintenance overhead: Due to the complex implementation, GiST indexes require more maintenance overhead than traditional indexes.

Full-Text Index

https://www.mssqltips.com/sqlservertutorial/9136/sql-server-full-text-indexes/

Definition

Full-text indexing is a database indexing technique that is used to improve the search performance of text-based queries. it’s a special type of index that is designed to handle text data. Unlike traditional indexes that store the values of individual columns, a full-text index stores the text content of one or more columns as a set of words or tokens. These words or tokens are then used to quickly locate relevant rows when a search query is performed.

Full-text indexing can greatly improve the performance of text-based search queries, especially when dealing with large amounts of text data. However, it also requires additional disk space and processing power, as well as careful tuning of the index parameters to ensure optimal performance.

How it works

The full-text indexing process involves several steps:

  1. Tokenization: The text content of the indexed column is broken down into individual words or tokens, which are then stored in the index. When a full-text index is created, the database system first analyzes the text content of the indexed columns and breaks it down into individual words or tokens. This process is called tokenization, and it may involve filtering out stop words (such as “the”, “and”, “or”) and performing stemming (reducing words to their base form).
  2. Indexing: The tokens are then indexed using a special data structure, such as a B-tree or inverted index. The index structure allows for efficient searching and retrieval of rows that contain specific tokens. Once the tokens have been generated, they are indexed using a special data structure, such as a B-tree or inverted index. The data structure allows for efficient searching and retrieval of rows that contain specific tokens.
  3. Querying: The database system then uses the full-text index to search for rows that contain the relevant tokens. The search process involves matching the query tokens against the indexed tokens and retrieving the rows that match the query. The search results can be ranked based on their relevance to the query, which is calculated using algorithms such as TF-IDF (term frequency-inverse document frequency).

Full-text indexing has some drawbacks to consider:

  1. Slower indexing and search performance: Full-text indexing can be more complex than other indexing techniques, which can result in slower indexing and search performance, especially for large databases with many text fields.
  2. Not suitable for all types of data: Full-text indexing is best suited for databases containing large amounts of textual data. It may not be the most effective indexing technique for databases that primarily contain numerical or other non-textual data.
  3. Language-dependent: Full-text indexing may not be as effective for multilingual databases, as it requires separate indexes for each language and may not be able to handle the nuances of different languages and writing systems.

Conclusion

In conclusion, database indexing is a critical technology that is used to improve the performance of database queries. It involves creating special data structures that allow for efficient searching and retrieval of data based on one or more columns in a table. Different types of indexing algorithms, such as B-trees, bitmap indexes, hash indexes, and GiST indexes, are used to optimize queries for different types of data and use cases.

here are some resources to learn more about database indexing:

  1. Use The Index, Luke!” by Markus Winand — This is a comprehensive guide to SQL database indexing, covering everything from the basics to advanced techniques.
  2. Database Indexing Explained” by DigitalOcean — This tutorial provides a beginner-friendly introduction to indexing concepts and techniques, with examples in PostgreSQL.
  3. Indexing Strategies for MySQL and MariaDB” by Severalnines — This blog post offers practical advice on designing and optimizing indexes in MySQL and MariaDB.

--

--