Index types supported in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL (B-tree)

An index is a data structure used by database systems to speed up queries. In most traditional relational database systems, indexes play a vital role in data retrieval. There are different index types, and each type uses a different algorithm that is best suited to different types of queries. A suboptimal index strategy, for example an incorrect index type, too many indexes, or no indexing at all can lead to performance degradation in your database.

In this series of posts, we discuss index types supported in Amazon Aurora PostgreSQL-Compatible edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL and their use cases. In this post, we discuss the native B-tree index and its variations. In an upcoming Part 2, we discuss other native indexes, including GIN, GiST, HASH, and BRIN.

Before deciding on what index is needed in our database, we need to understand the access patterns and the data that needs to be retrieved using an index. Indexes come with maintenance and space overhead, and we should apply it reasonably. In the following sections, we discuss the strategies to apply when creating an index depending on your data retrieval needs and use case.

Prerequisites

To follow along, you should have an RDS for PostgreSQL or Amazon Aurora PostgreSQL-compatible database and a client machine with psql installed and a connection to the database. For more information, refer to Connecting to a DB instance running the PostgreSQL database engine.

Refer to the following script to generate the sample dataset. This script is generating a sample dataset for an employee table with various columns containing different data types like character strings, integers, dates, arrays, JSONB, etc. It’s using PostgreSQL functions like RANDOM(), REPEAT(), GENERATE_SERIES() to generate random sample data which will used throughout this post.

CREATE TABLE employee ( 
      emp_name          CHARACTER VARYING COLLATE "C"
    , emp_id            BIGINT
    , emp_salary        BIGINT
    , emp_dept           CHARACTER VARYING
    , emp_doj           DATE
    , emp_lwd           DATE
    , emp_age           INTEGER
    , emp_cities        TEXT
    , emp_address_hist  JSONB
    , emp_city_list     TEXT[]
    , emp_tenure        TSRANGE
    ); 

INSERT INTO employee
SELECT
substr(REPEAT('abexoptkhngsadl',3),((random()*(30-1)+1)::integer),5) AS emp_name
,I AS emp_id
,TRUNC(RANDOM() * 10000) AS emp_salary
,UPPER(substr(REPEAT(REVERSE('abexoptkhngsadl'),3),((random()*(30-1)+1)::integer),4)) AS emp_dept
, '1/1/2016'::DATE + ('1 YEAR'::INTERVAL * FLOOR(RANDOM()*5)) AS emp_doj
, '1/1/2020'::DATE + ('2 YEAR'::INTERVAL * FLOOR(RANDOM()*6)) AS emp_lwd
, floor(random() * (39-30+1) + 30)::int AS emp_age
,'CITY'||I||',CITYX,'||'CITY'||I%2 AS emp_cities
,TO_JSONB(ROW(I::TEXT, 'CITY'||I||',CITYX,'||'CITY'||I%2)) AS emp_address_hist
,ARRAY[CONCAT_WS('','CITY',I),'CITYX',CONCAT_WS('','CITY',I%2)] AS emp_city_list
,('[2015-'||I%12||'-'||I%30||', 2017-'||I%12||'-'||I%30||')')::TSRANGE AS emp_tenure
FROM GENERATE_SERIES(1, 5000000) AS I
WHERE i%12 BETWEEN 1 AND 13 AND i%30 BETWEEN 1 AND 28;

B-tree indexes

A balanced tree (B-tree) is a self-balancing tree data structure that maintains sorted data and allows searches. B-tree indexes can handle equality and range queries on data that can be sorted. You can use a B-tree index whenever an indexed column is involved in a filter condition with one of these operators: <, <=, =, >=, and >. The B-tree indexing method in PostgreSQL is the most commonly used index type, and is the default for many data types.

The following code shows a query explain plan without an index on the emp_id column with the equality operator:

EXPLAIN ANALYZE SELECT * FROM employee WHERE emp_id = 444;
Seq Scan on employee  (cost=0.00..1847027.15 rows=1 width=204) (actual time=16018.764..16018.765 rows=0 loops=1)
Filter: (emp_id = 444)
Rows Removed by Filter: 43333335
Planning Time: 0.059 ms
Execution Time: 16018.789 ms

(Note: Parallel Query execution is disabled (set max_parallel_workers_per_gather = 0) for this example.)

CREATE INDEX idx_emp_id ON employee(emp_id);

Now we run the same explain plan after creating an index on emp_id.

Index Scan using idx_emp_id on employee  (cost=0.56..8.58 rows=1 width=201) (actual time=1.013..1.016 rows=0 loops=1)
Index Cond: (emp_id = 444)
Planning Time: 1.001 ms
Execution Time: 3.446 ms

We can observe from the explain plan that optimizer is choosing index scan (on index over sequential scan after index is created on filter condition column (WHERE emp_id=444). Now we are getting faster execution time after index creation.

In the remainder of this post, we’ll review what other specialized B-tree indexes we can implement to get faster query execution time. We also look at scenarios where we can take leverage of b-tree index to get faster query response.

Index Only Scan

Index are pointers to actual data present on disk and are stored separately from the table’s main data area. (i.e. table heap). This means each row retrieval will require fetching data from index as well as table heap. While index entries for fetched record (EMP_ID = 444) will be placed closer on the disk, table row could be present anywhere on the table heap. The heap-access portion of an index scan thus involves a lot of random access into the heap, which can be slow. So, to make index scan further fast, if the columns used in the filter conditions are the only columns fetched in your select, then row can be fetched only using index heap area. This is referred as Index only scan.

EXPLAIN ANALYZE SELECT emp_id FROM employee WHERE emp_id = 444;

Index Only Scan using idx_emp_id on employee  (cost=0.56..4.58 rows=1 width=8) (actual time=0.032..0.032 rows=0 loops=1)
Index Cond: (emp_id = 444)
Heap Fetches: 0
Planning Time: 0.112 ms
Execution Time: 0.057 ms

Impact of indexing on range operations.

B-tree also works for queries with the WHERE clause with “BETWEEN AND” operator, which is same as >= and <=. For example, values greater than qual to 111 and less than equal to 222 can also be represented as values between 111 and 222.

The following code uses a B-tree index for BEWTEEN AND operator.

EXPLAIN ANALYZE SELECT * FROM employee WHERE emp_id between 111 and 222 ;

Index Scan using employee_emp_id_idx on employee  (cost=0.29..569.42 rows=193 width=192) (actual time=0.048..0.374 rows=194 loops=1)
Index Cond: ((emp_id >= 111) AND (emp_id <= 222))
Planning Time: 3.170 ms
Execution Time: 0.426 ms

Using B-Tree to speed up ordering

A PostgreSQL B-tree index can produce sorted output. You can also use a B-tree to return the “top N” ordered selections. You can also use an index to order by the top N rows type of SQLs. Because the B-tree returns the data in sorted order, there is no need to sort the entire dataset. This also benefits queries that use aggregates like min/max. See the following code:

EXPLAIN ANALYZE SELECT * FROM employee ORDER BY emp_id DESC LIMIT 10;

Limit(cost=0.56..1.13 rows=10 width=204) (actual time=1.956..1.959 rows=10 loops=1)
-> Index Scan Backward using idx_emp_id on employee (cost=0.56..2430589.44 rows=43331792 width=204) (actual time=1.955..1.957 rows=10 loops=1)
Planning Time: 0.371 ms
Execution Time: 2.005 ms

EXPLAIN ANALYZE select max(emp_id) from employee ;

Result  (cost=0.59..0.60 rows=1 width=8) (actual time=0.208..0.209 rows=1 loops=1)
InitPlan 1 (returns $0)
->  Limit  (cost=0.56..0.59 rows=1 width=8) (actual time=0.203..0.203 rows=1 loops=1)
->  Index Only Scan Backward using idx_emp_id on employee  (cost=0.56..1233558.56 rows=43330400 width=8) (actual time=0.201..0.201 rows=1 loops=1) Index Cond: (emp_id IS NOT NULL)
Heap Fetches: 0
Planning Time: 0.215 ms
Execution Time: 0.244 ms

Using B-tree indexes with similarity queries (LIKE)

You can use a B-tree index for queries using a pattern matching operator (LIKE OR ~ OR !~). This is only applicable when the database is created using the C collation or table column on which the index is needed should be created collated as C in create table statement. (emp_name CHARACTER VARYING COLLATE “C”). You can still use LIKE in pattern matching for non-C locale if you explicitly define it as part of the operator class: Refer to Operator Classes and Operator Families for details.

Pattern matching is for a constant and should be present at the beginning of the string—for example, queries with emp_name like ‘dub%’. The index will not be used for SQL with emp_name like ‘%dub’.

Let see example below with like clause without index.

EXPLAIN ANALYZE SELECT * FROM public.employee WHERE emp_name LIKE 'dub%';

Seq Scan on employee (cost=0.00..1846999.70 rows=1 width=201) (actual time=4247.172..4247.176 rows=0 loops=1)
Filter: ((emp_name)::text ~~ 'dub%'::text)
Rows Removed by Filter: 43333335
Planning Time: 0.108 ms
Execution Time: 4247.200 m

Let’s see an example where we have 2 employee table in different schema, one with emp_name with C collation and one without it.

SELECT table_schema, table_name, column_name, collation_name FROM information_schema.columns WHERE table_name LIKE 'emp%' AND column_name = 'emp_name';
table_schema | table_name | column_name | collation_name
--------------+------------+-------------+----------------
 test         | employee   | emp_name    |
 public       | employee   | emp_name    | C

CREATE INDEX idx_emp_name ON test.employee(emp_name);
CREATE INDEX idx_emp_name ON public.employee(emp_name);

Like query on public.employee with column (emp_name CHARACTER VARYING COLLATE “C”) will go through index scan.

EXPLAIN ANALYZE SELECT * FROM public.employee WHERE emp_name LIKE 'dub%';
Index Scan using idx_emp_name on employee  (cost=0.44..8.46 rows=1 width=201) (actual time=0.025..0.026 rows=0 loops=1)
Index Cond: (((emp_name)::text >= 'dub'::text) AND ((emp_name)::text < 'duc'::text))
Filter: ((emp_name)::text ~~ 'dub%'::text)
Planning Time: 0.139 ms
Execution Time: 0.047 ms

Like query on test.employee with column (emp_name CHARACTER VARYING) i.e. Without COLLATE “C” clause, will go through sequential scan.

EXPLAIN ANALYZE SELECT * FROM test.employee WHERE emp_name LIKE 'dub%';

Seq Scan on employee  (cost=0.00..1778.69 rows=1 width=192) (actual time=15.203..15.204 rows=0 loops=1)
Filter: ((emp_name)::text ~~ 'dub%'::text)
Rows Removed by Filter: 43335
Planning Time: 0.986 ms
Execution Time: 15.236 ms

Multi-column indexes

An index created on multiple columns is called a multi-column index. If you have a some queries in your use case that needs filtering on column ( emp_name ) or ( emp_dep ) and also have a few queries that involve both ( emp_name, emp_dep ) columns as well, you can create a multicolumn index. A single multi column index on both columns will generally use index scan for all queries involving the column combinations of these columns using equality operator.

We can see in the below example that if we have created multi-column index on ( emp_name, emp_dep ), then the index scan is being used in all such cases when either of column emp_name or emp_dep or emp_name and emp_dep is used in where clause.

CREATE INDEX idx_emp_name_dept ON employee (emp_name, emp_dep);
EXPLAIN ANALYZE SELECT * FROM employee WHERE emp_name='drnkcev' AND emp_dep='K03YU';

Index Scan using idx_emp_name_dept on employee  (cost=0.29..13.29 rows=1 width=192) (actual time=0.017..0.018 rows=0 loops=1)
Index Cond: (((emp_name)::text = 'drnkcev'::text) AND ((emp_dep)::text = 'K03YU'::text))
Planning Time: 0.183 ms
Execution Time: 0.042 ms

EXPLAIN ANALYZE SELECT * FROM employee WHERE emp_name='drnkcev' ;
Index Scan using idx_emp_name_dept on employee  (cost=0.29..13.29 rows=1 width=192) (actual time=0.018..0.019 rows=0 loops=1)
Index Cond: ((emp_name)::text = 'drnkcev'::text)
Planning Time: 0.074 ms
Execution Time: 0.044 ms

EXPLAIN ANALYZE SELECT * FROM employee WHERE emp_dep='K03YU' ;
Index Scan using idx_emp_name_dept on employee  (cost=0.29..591.79 rows=1 width=192) (actual time=0.202..0.202 rows=0 loops=1)
Index Cond: ((emp_dep)::text = 'K03YU'::text)
Planning Time: 0.073 ms
Execution Time: 0.228 ms

However, the order of the columns in the index matters. A multicolumn B-tree index is most efficient when there are constraints on the leading (leftmost) columns. For example, given an index on (emp_age, emp_salary, emp_id) and a query condition WHERE emp_age = 45 AND emp_salary >= 1000 AND emp_id < 200, the index would have to be scanned from the first entry with emp_age = 45 and emp_salary = 1000 up through the last entry with emp_age = 45 . Index entries with emp_id < 200 would be skipped, but they’d still have to be scanned through.

Partial indexes

A partial index is an index created on a subset of table data. The subset is defined on a conditional predicate. One major use case of this index type is to avoid indexing common values. For example, suppose we have the emp_age field, where a very small percentage of employees are not 35 years old, and we’re only interested in finding these employees and don’t care about common values (employees who are 35). We can create a partial index on this subset of table data (age <> 35) and use an index for SQL to filter non-common values (all ages other than 35). Another advantage of this index type is the index size, which will be smaller than an index on a column without conditions.

CREATE INDEX idx_emp_age_subset ON employee(emp_age) WHERE NOT (emp_age = 35)

Name        | Type  |  Owner   |  Table   | Persistence | Access method |  Size
-------------------+-------+----------+----------+-------------+---------------+-------
idx_emp_age        | index | postgres | employee | permanent   | btree         | 286 MB
idx_emp_age_subset | index | postgres | employee | permanent   | btree         | 80 kB

EXPLAIN ANALYZE select * from employee where emp_age = 45;
		 
Index Scan using idx_emp_age_subset on employee  (cost=0.29..1213.03 rows=24557 width=204) (actual time=0.010..1.321 rows=8665 loops=1)
Index Cond: (emp_age = 45)
Planning Time: 0.282 ms
Execution Time: 1.648 ms

Another use case is when you’re only fetching specific data according to your interest (eliminating rows from the index that you don’t want to retrieve, such as uninteresting values). This is the more common use case than the one above. You can also create unique partial indexes for given conditions.

There is an overhead with partial indexes because it requires that the common values be predetermined. Therefore, partial indexes are best used for data distributions that are well known and don’t change over a period of time. Note that you shouldn’t create too many partial indexes defined for too many conditions (ranges), which would result in partial indexes as a substitute for partitioning.

Functional indexes

A functional index is defined on the result of an immutable function applied to one or more columns of a single table. You can use a functional index to improve performance of the query with some expressions or functions. In the following example, we create a functional index on the column emp_name with the LOWER function:

EXPLAIN ANALYZE SELECT emp_id, emp_name FROM employee WHERE  LOWER(emp_name) = 'doomhes';

Seq Scan on employee  (cost=0.00..196160.58 rows=22928 width=14) (actual time=890.211..890.211 rows=0 loops=1)
Filter: (lower((emp_name)::text) = 'doomhes'::text)
Rows Removed by Filter: 4333335
Planning Time: 0.346 ms
Execution Time: 890.553 ms

CREATE INDEX idx_emp_name_fn ON employee(LOWER(emp_name));
		  
EXPLAIN ANALYZE SELECT emp_id, emp_name FROM employee WHERE  LOWER(emp_name) = 'doomhes';

Bitmap Heap Scan on employee  (cost=244.35..56725.60 rows=21667 width=14) (actual time=0.071..0.072 rows=0 loops=1)
Recheck Cond: (lower((emp_name)::text) = 'doomhes'::text)
->  Bitmap Index Scan on idx_emp_name_fn  (cost=0.00..238.94 rows=21667 width=0) (actual time=0.065..0.066 rows=0 loops=1)
Index Cond: (lower((emp_name)::text) = 'doomhes'::text)
Planning Time: 1.278 ms
Execution Time: 0.276 ms

You can also use a functional index to create an index with different data types than its column data type.

For example, if the application is querying using the BPCHAR type in a WHERE clause but the column (emp_name) is of type VARCHAR/TEXT, then we must convert the index type to BPCHAR, while creating it, which will give better performance. See the following code:

EXPLAIN ANALYZE SELECT emp_id, emp_name FROM employee WHERE emp_name =bpchar('doomhes');
Seq Scan on employee  (cost=0.00..181543.69 rows=1 width=14) (actual time=723.032..723.032 rows=0 loops=1)
Filter: ((emp_name)::bpchar = 'doomhes'::bpchar)
Rows Removed by Filter: 4333335
Planning Time: 0.498 ms
Execution Time: 723.066 ms
		  
CREATE INDEX idx_emp_name_varchar_to_bpchar ON employee (bpchar(emp_name));

EXPLAIN ANALYZE SELECT emp_id, emp_name FROM employee WHERE emp_name =bpchar('doomhes');
		  
Index Scan using idx_emp_name_varchar_to_bpchar on employee  (cost=0.43..8.45 rows=1 width=14) (actual time=1.221..1.222 rows=0 loops=1)
Index Cond: ((emp_name)::bpchar = 'doomhes'::bpchar)
Planning Time: 4.115 ms
Execution Time: 1.246 ms

Cluster the table on the index

When you need to retrieve data within a range for example, ( emp_id > 100 and emp_id < 200 ), the response of this SQL will be different on different systems depending on the physical placement of data on the disk. In our example, because we generated the emp_id data using generate_series(), the data is organized by emp_id and is physically lying sequentially on disk. This is called clustering the table data. You can cluster tables at the index level, and you can use this method for more than one index. This is useful when you want related data to be kept close to each other and hence the cost of retrieving the data should be less. If you have index on the column in the where condition and still cost of execution is high due to it is going for Bitmap heap scan then just cluster the index and you will get performance benefit as planner will go for index scan.

Let’s look at an example. If we load data into the employee table using the preceding script with an order by random() clause, we can simulate a real-life non-clustered table situation ( correlation <> 1 ). We see the query run plan change to bitmap Index Scan, and it takes longer to fetch this data because it’s not present in a sequential manner on disk ( correlation=1 ). Instead, it’s distributed ( correlation <> 1 ).

Note that correlation is a statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to the reduction of random access to the disk.

SELECT tablename, attname, correlation FROM pg_stats WHERE tablename = 'employee' AND attname = 'emp_id' order by 1;
tablename | attname |  correlation
-----------+---------+---------------
employee1 | emp_id  | -0.0005668058

EXPLAIN ANALYZE SELECT * FROM employee WHERE emp_id &gt; 444 AND emp_id &lt; 555;

Bitmap Heap Scan on employee  (cost=5.40..375.15 rows=94 width=199) (actual time=0.563..42.591 rows=95 loops=1)
Recheck Cond: ((emp_id > 444) AND (emp_id < 555))
Heap Blocks: exact=95
->  Bitmap Index Scan on idx_emp_id  (cost=0.00..5.37 rows=94 width=0) (actual time=0.015..0.016 rows=95 loops=1)
Index Cond: ((emp_id > 444) AND (emp_id < 555))
Planning Time: 0.284 ms
Execution Time: 42.639 ms

If we cluster the table using the index on emp_id (correlation = 1), the planner will choose Index Scan over bitmap Index Scan because it is aware (through updated statistics). The data is now placed correlated on the disk and fetching from the disk is faster.

NOTE: When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. This prevents any other database operations (both reads and writes) from operating on the table until the CLUSTER is finished. Hence perform this operation during maintenance window.

CLUSTER employee USING idx_emp_id;
CLUSTER

ANALYZE employee;
ANALYZE

SELECT tablename, attname, correlation FROM pg_stats WHERE tablename = 'employee' AND attname = 'emp_id' ORDER BY 1;
tablename | attname | correlation
-----------+---------+-------------
employee  | emp_id  |           1

EXPLAIN ANALYZE SELECT * FROM employee WHERE emp_id &gt; 444 AND emp_id &lt; 555;

Index Scan using idx_emp_id on employee1  (cost=0.56..12.53 rows=98 width=204) (actual time=0.553..1.166 rows=95 loops=1)
Index Cond: ((emp_id > 444) AND (emp_id < 555))
Planning Time: 0.941 ms
Execution Time: 1.186 ms

When to choose Clustering

If you’re requesting a range ( emp_id > 444 and emp_id < 555 ) of indexed values from a table, or a single indexed value that has multiple rows ( emp_id > 444 ) that match, clustering the table will help because after the index identifies the table page for the first row that matches, all other rows that match are probably already on the same table page, so you save disk reads and speed up the query. The runtime will be further reduced to take advantage of the clustered data.

Clean up

Complete the following steps to clean up the resources and data you created during this post:

Delete your DB cluster or DB instance.
Delete the sample data with the following query:

DROP TABLE employee;

Conclusion

In this post, we showed you various use cases of the B-tree index type supported in Amazon Aurora PostgreSQL compatible and Amazon RDS for PostgreSQL. If you have similar use cases in your application or workload, you can refer to this post and implement similar solutions.

In upcoming Part 2 of this series, we discuss other native indexes, including GIN, GiST, HASH, and BRIN.

Leave any thoughts or questions in the comments section.

About the Authors

Rajkumar Raghuwanshi is a Database Consultant with AWS Professional Services based out of Pune, India. With a good knowledge of relational databases and hands-on experience in homogenous and heterogenous database migrations, he helps customers migrate to the AWS Cloud and optimize performance.

Sachin Khanna is a Lead Database Consultant on the AWS Professional Services team at AWS. He has extensive experience in relational databases with a passion for Amazon Aurora PostgreSQL. He is a consultant with expertise on performance and optimization. He carries rich experience in database migrations and cost optimization, and has helped customers in their cloud migration journey.

Sachin Kotwal is a Lead Database Consultant with AWS Professional Services based out of Pune, India. He has more than a decade of experience in relational database administration and architecture. He has worked on many homogenous database migrations and performance optimizations.

AWS Database Blog