AWS Database Blog
Manage case-insensitive data in PostgreSQL
It’s important to be aware of the case sensitivity of text data in PostgreSQL when performing queries or comparisons to ensure that the results match your intended expectations. Case also affects how the database performs sorting operations. By default, PostgreSQL is case sensitive when sorting or comparing string values. For example, PostgreSQL considers “amazon” and “Amazon” to be different values, and this impacts how they are sorted.
If you want to search for a specific value regardless of its case, you need to use a case-insensitive comparison function or normalize the data to a consistent case. Case insensitivity in PostgreSQL allows for greater flexibility and ease of use when working with data.
In this post, we discuss different options to deal with the case-insensitive data in PostgreSQL and their use cases, pros, and cons.
Solution overview
Case insensitivity in PostgreSQL can improve productivity, reduce errors, and make it easier to work with data and identifiers in a flexible and consistent way. However, there are some performance tradeoffs to consider. PostgreSQL provides different options to achieve case insensitivity in data. We discuss the following options in this post:
- Using case-insensitive comparison functions
- Using
ILIKE
or~~*
operators for comparison - Using the
CITEXT
data type - Using a custom
=
(equal to) operator - Using a case-insensitive ICU collation
Prerequisites
If you want to follow along this post’s examples, complete the following prerequisite steps:
- Create an Amazon Aurora PostgreSQL-Compatible Edition cluster or an Amazon Relational Database Service (Amazon RDS) for PostgreSQL instance if you don’t already have one. For instructions, refer to Create an Aurora PostgreSQL DB cluster or Create a PostgreSQL DB instance respectively.
- Create an Amazon Elastic Compute Cloud (Amazon EC2) instance to install the PostgreSQL client to access the Aurora PostgreSQL or RDS for PostgreSQL instance. For instructions, refer to Create your EC2 resources and launch your EC2 instance. Alternatively, you can set up connectivity between your RDS database and EC2 compute instance in 1-click
- Install the PostgreSQL client. On Amazon Linux, we can use the following commands to download the psql command line tool:
Option 1: Using case-insensitive comparison functions
In this option, we discuss the PostgreSQL in-built comparison functions LOWER
and UPPER
. These functions allow you to convert a string to lowercase or uppercase at runtime, respectively.
The lower
function takes a string as an argument and returns a new string with all the characters in lowercase. For example:
The upper
function, on the other hand, takes a string as an argument and returns a new string with all the characters in uppercase. For example:
This is a straightforward way of dealing with case insensitivity in PostgreSQL. When performing a search on a text column, you can use the lower or upper function to convert the search term to lowercase or uppercase and then search for that term in the lowercased or uppercased text column. This ensures that the search is case insensitive, so you can find matches regardless of the case of the text data. Also, when importing data from external sources, it’s common to use the lower
and upper
functions to standardize the case of text data. This helps ensure that all text data is in the same case format, making it easier to compare and manipulate.
Benefits
This option offers built-in functions and is simple to use.
Drawbacks
If you want to use an index for case-insensitive searches or comparisons, you’ll need to create a specialized index such as an expression index or a specialized data type like citext
instead of relying on a standard text index. Additionally, PostgreSQL can’t use an expression index if the expression is not present, so you would need to add another index for case-sensitive searches. Complete the following steps:
- Create a test table:
- Insert some sample data. The following query inserts 1 million records:
- Create a regular index on the email column:
- Run the following query, which doesn’t use the index created in the previous step because it’s a normal index. You can see the full table scan (seq scan) as highlighted:
- Create a functional index on the uppercased email column:
- Run the same query to see if the functional index is utilized:
Option 2: Using ILIKE
or ~~*
operators for comparison
In this option, we discuss PostgreSQL built-in string comparison operators ILIKE
and ~~*
. They are equivalent in functionality, but ILIKE
is the standard SQL-compliant operator whereas ~~*
is a PostgreSQL-specific operator.
The ILIKE
operator matches two strings, ignoring differences in letter case. It’s equivalent to the LIKE
operator but case insensitive. For example, the expression 'hello' ILIKE 'HelLO'
would evaluate to TRUE
:
The ~~*
operator is the PostgreSQL-specific version of ILIKE
, and performs a case-insensitive string comparison similar to ILIKE
. For example, the expression ‘hello' ~~* 'HelLO'
would evaluate to TRUE
:
When accepting user input in a search form or filter, you can use the ILIKE
or ~~*
operator to search for matching records in a case-insensitive manner. This ensures that users can find the records they’re looking for regardless of the case of the input. You can also use these operators when searching for specific text patterns in a large dataset. This is useful for scenarios such as finding all the product names that contain a certain keyword, regardless of the case of the keyword.
Benefits
This option offers built-in operators and is simple to use.
Drawbacks
You cannot use B-tree indexes, even expression indexes, with the ILIKE
or ~~*
operators. However, you can use PostgreSQL’s GIN or GiST index types to build indexes for these types of operators. More details on GIN and GiST indexes including the differences can be found in Preferred Index Types for Text Search documentation. Keep the following in mind:
- You must use the
pg_trgm
extension. This extension is available in Amazon Aurora and Amazon RDS, but it is not enabled by default. You can install it in your database using “CREATE EXTENSION pg_trgm;
“ - GIN indexes can be significantly larger than B-tree indexes, especially for data types with many distinct values or long text fields. This can impact the speed of index updates and maintenance. Creating a GIN index, as well as inserts and updates, can take longer than when using a B-tree index.
- GIN indexes are optimized for certain query patterns, such as searching for multiple values or ranges within the indexed data. However, they may not be as effective for other types of queries, such as equality or range queries on a single column.
Using the same table that we created earlier (case_test_opt1
), we can create a normal, functional, and GIN index to see the index creation time difference:
For this data set, the GIN index creation time took almost three times longer than the B-tree indexes from the previous example.
The following query shows that PostgreSQL scans the GIN index when using the ILIKE operator:
Clean up after you have completed the testing:
Option 3: Using the CITEXT
data type
citext
is a case-insensitive text data type in PostgreSQL. It’s an extension module that provides a case-insensitive version of the built-in text data type. When citext is used for a column, any string value inserted into that column is automatically converted to lowercase, and any queries on that column are case insensitive.
To use citext
, you must first enable the citext
extension in your database using the following command:
After that, you can create a table with a citext
column using following statement:
If you’re migrating data from a system that uses case-insensitive text
types, using citext
can simplify the migration process by allowing you to maintain the same data format. citext
can help prevent data entry errors or inconsistencies in your data. With citext
, you don’t need to use lower
and upper
functions in your queries to perform case-insensitive searches or comparisons. This can make your queries simpler and easier to read.
Benefits
This option offers a built-in extension with the citext
data type. Additionally, migration from other databases with case-insensitive data types is straightforward.
Drawbacks
A drawback to this option is that you can’t perform case-sensitive searches if needed. Also, if you have a citext
type on one side of an operator and a text
on the other for a query, then it might not work as expected. So, we need to be careful with such queries.
Option 4: Using a custom =
operator
PostgreSQL allows users to define custom operators to extend the built-in set of operators. Custom operators can be defined to perform specific operations that aren’t supported by the built-in operators, or to provide shorthand notations for frequently used operations. In this option, we show how to create a new = operator to deal with the case-insensitive data. Complete the following steps:
- Create a new schema to define the operator and grant necessary privileges on the schema to public:
- Create a function to define the functionality of the operator. This function basically changes the case of right and left operands to lowercase so that comparison is done without modifying the query to use any built-in functions like
lower
orupper
. - Grant the necessary privileges on the function to public:
- Create the operator using the new function:
- Test the new operator, for example:
- Query the data with last_name and review the results:
You can see only one row based on the case.
- Set the
search_path
to use theutil
schema beforepg_catalog
and check the results:
You can see all three rows regardless of case.
Alternatively, you can use the absolute path of the operator using schema instead of setting the search_path
:
You can use an expression index if required:
Benefits
One benefit of this option is that you might not need to change the application queries to add any built-in functions like lower
and upper
or add any operators like ~~*
or ILIKE
if you choose to add the schema (in which the operator is created) to the search_path
and set it to the database.
Additionally, you can set the search_path
as a switch to make the data case insensitive if you want only part of your application to behave like case insensitive without making any changes to the application.
Drawbacks
You need to add the search_path
(with the util
schema before pg_catalog
) to the application queries or alter the database to set the search_path
:
In general, custom operators can be a powerful tool for developers and DBAs to extend the functionality of PostgreSQL and make it simpler to work with some operations. However, it’s important to use them judiciously and follow best practices for naming, documentation, and testing to ensure that they are consistent and efficient.
Clean up after you have completed the testing:
Option 5: Using a case-insensitive ICU collation
“Custom collations” are a feature of the ICU collation provider. ICU allows extensive control over collation behavior by defining new collations with collation settings as a part of the language tag. These settings can modify the collation order to suit a variety of needs. You can go through Manage collation changes in PostgreSQL on Amazon Aurora and Amazon RDS for more information on collations.
The following is a test case with custom collation to access case-insensitive data:
- Create a collation using the following command:
- Create a table and insert the data:
- Select the data and check the results:
You can see all three rows regardless of case.
Benefits
Like Option 4, you don’t need to change the application queries to add any built-in functions like lower
or upper
or add any operators like ~~*
or ILIKE
. Also, you don’t need to set the search_path
.
Drawbacks
The major drawback is pattern matching does not work for nondeterministic collations that is case-insensitive collations. As of PostgreSQL 15, this also does not address the ability to index nondeterministic collations. Also, the sorting of the data might change as well based on the collation that you use. You have to deal with sorting if you are changing the collation. For example, the following has the order of last_name
as bar, BAR, BaR
:
if we modify the column’s collation and run a query we observe that the order of column changed to bar, BaR, BAR
:
You can see that changing the collation changed the order of the rows.
Although custom collations in PostgreSQL can provide many benefits, it’s important to carefully consider the potential drawbacks before implementing them in a production environment. Proper planning, testing, and ongoing maintenance can help ensure that custom collations work effectively and don’t negatively impact database performance or compatibility.
Clean up after you have completed the testing:
Conclusion
PostgreSQL provides several options for dealing with case insensitivity, each with its own advantages and disadvantages. The best approach depends on the specific requirements of your application and the data being stored in the database. In this post, we discussed a few options with their pros and cons. Based on your requirements, you can choose the right option for your application.
If you have any questions or suggestions about post, leave a comment.
About the authors
Baji Shaik is a Sr. Lead Consultant with AWS ProServe, Global Competency Centre. His background spans a wide depth and breadth of expertise and experience in SQL and NoSQL database technologies. He is a Database Migration Expert and has developed many successful database solutions addressing challenging business requirements for moving databases from on premises to Amazon RDS and Amazon Aurora PostgreSQL/MySQL. He is an eminent author, having written several books on PostgreSQL. A few of his recent works include “PostgreSQL Configuration,” “Beginning PostgreSQL on the Cloud,” and “PostgreSQL Development Essentials.” Furthermore, he has delivered several conference and workshop sessions.
Scott Mead is a Database Engineer at AWS.