AWS Machine Learning Blog
Take your intelligent search experience to the next level with Amazon Kendra hierarchical facets
Unstructured data continues to grow in many organizations, making it a challenge for users to get the information they need. Amazon Kendra is a highly accurate, intelligent search service powered by machine learning (ML). Amazon Kendra uses deep learning and reading comprehension to deliver precise answers, and returns a list of ranked documents that match the search query for you to choose from. To help users interactively narrow down the list of relevant documents, you can assign metadata at the time of document ingestion to provide filtering and faceting capabilities.
In a search solution with a growing number of documents, simple faceting or filtering isn’t always sufficient to enable users to really pinpoint documents with the information they’re looking for. Amazon Kendra now features hierarchical facets, with a more granular view of the scope of the search results. Hierarchical facets offer filtering options with more details about the number of results expected for each option, and allows users to further narrow their search, pinpointing their documents of interest quickly.
In this post, we demonstrate what hierarchical facets in Amazon Kendra can do. We first ingest a set of documents, along with their metadata, into an Amazon Kendra index. We then make search queries using both simple and hierarchical facets, and add filtering to get straight to the documents of interest.
Solution overview
Instead of presenting each facet individually as a list, hierarchical facets enable defining a parent-child relationship between facets to shape the scope of the search results. With this, you see the number of results that not only have a particular facet, but also have each of the sub-facets. Let’s take the example of a repository of AWS documents of types User_Guides
, Reference_Guides
and Release_Notes
, regarding compute, storage, and database technologies.
First let’s look at non-hierarchical facets from the response to a search query:
Here we know the number of search results in each of the technologies, as well as each of the document types. However, we don’t know, for example, how many results to expect from User_Guides
related to Storage
, except that it’s going to be less than 22, as the smaller of the number of results from User_Guides:37
and from Storage:22
.
Now let’s look at hierarchical facets from the response to the same search query:
With hierarchical facets, we get more information in terms of the number results from each document type about each technology. With this additional information, we know that there are 16 results from User_Guides
related to Storage
.
In the subsequent sections, we use this example to demonstrate the use of hierarchical facets to narrow down search results along with step-by-step instructions you can follow to try this out in your own AWS account. If you just want to read about this feature without running it yourself, you can refer to the Python script facet-search-query.py used in this post, and its output output.txt, and then jump to the section Search and filtering using facets without hierarchy.
Prerequisites
To deploy and experiment with the solution in this post, make sure that you have the following:
- An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
- Basic knowledge of AWS and the ability to work on the AWS Management Console.
- The ability to create an Amazon Simple Storage Service (Amazon S3) bucket.
- Access to AWS CloudShell.
- Basic knowledge of Python programming.
Set up the infrastructure and run the Python script to query the Amazon Kendra index
To set up the solution, complete the following steps:
- Use the AWS Management Console for Amazon S3 to create an S3 bucket to use as a data source to store the sample documents.
- On the AWS Management Console, start CloudShell by choosing the shell icon on the navigation bar.
Alternatively, you can run the Python script from any computer that has the AWS SDK for Python (Boto3) installed and an AWS account with access to the Amazon Kendra index. Make sure to update Boto3 on your computer. For simplicity, the step-by-step instructions in this post focus on CloudShell.
- After CloudShell starts, download facet-search-query.py to your local machine.
- Upload the script to your CloudShell by switching to the CloudShell tab, choosing the Actions menu, and choosing Upload file.
- Download hierarchical-facets-data.zip to your local machine, unzip it, and upload the entire directory structure to your S3 bucket.
- If you’re not using an existing Amazon Kendra index, create a new Amazon Kendra index.
- On the Amazon Kendra console, open your index.
- In the navigation pane, choose Facet definition.
- Choose Add field.
- Configure the field
Document_Type
and choose Add.
- Configure the field
Technology
and choose Add.
- Configure your S3 bucket as a data source to the Amazon Kendra index you just created.
- Sync the data source and wait for the sync to complete.
- Switch to the CloudShell tab.
- Update Boto3 by running
pip3 install boto3=1.23.1 --upgrade
.
This ensures that CloudShell has a version of Boto3 that supports hierarchical facets. - Edit
facet-search-query.py
and replaceREPLACE-WITH-YOUR-AMAZON-KENDRA-INDEX-ID
with your Amazon Kendra index ID.
You can get the index ID by opening your index details on the Amazon Kendra console.
- In the CloudShell prompt, run
facet-search-query.py
using the commandpython3 facet-search-query.py | tee output.txt
.
If this step is canceled with the error Unknown parameter in Facets[0]: “Facets”, must be one of: DocumentAttributeKey
,
choose the Actions menu, and choose Delete AWS CloudShell home directory. Repeat the steps to download facet-search-query.py
, update Boto3, edit facet-search-query.py
, and run it again. If you have any other data in the CloudShell home directory, you should back it up before running this step.
For convenience, all the steps are included in one Python script. You can read facet-search-query.py
and experiment by copying parts of this script and making your own scripts. Edit output.txt
to observe the search results.
Search and filtering with facets without hierarchy
Let’s start by querying with facets having no hierarchy. In this case, the facets parameter used in the query only provides the information that the results in the response should be faceted using two attributes: Technology
and Document_Type
. See the following code:
This is used as a parameter to the query API call:
The formatted version of the response is as follows:
The first result from the response is from a User_Guide
about Databases
. The facets below the result show the number of results for Technology
and Document_Type
present in the response.
Let’s narrow down these results to be only from User_Guides
and Storage
by setting the filter as follows:
Now let’s make a query call using the facets without hierarchy and the preceding filter:
A formatted version of the response is as follows:
The response contains 16 results from User_Guides
on Storage
. Based on the non-hierarchical facets in the response without filters, we only knew to expect fewer than 22 results.
Search and filtering with hierarchical facets with Document_Type as a sub-facet of Technology
Now let’s run a query using hierarchical facets, with the relationship of Document_Type
being a sub-facet of Technology
. This hierarchical relationship is important for a Technology
-focused user such as an engineer. Note the nested facets in the following definition. The MaxResults
parameter is used to display only top MaxResults
facets. For our example, there are only three facets for Technology
and Document_Type
, therefore this parameter isn’t particularly useful. When the number of facets is high, it makes sense to use this parameter.
The query API call is made as follows:
The formatted version of the response is as follows:
The results are classified as per the Technology
facet followed by Document_Type
. In this case, looking at the facets, we know that 16 results are from User_Guides
about Storage
and 7 are from Reference_Guides
related to Databases.
Let’s narrow down these results to be only from Reference_Guides
related to Databases
using the following filter:
Now let’s make a query API call using the hierarchical facets with this filter:
The formatted response to this is as follows:
From the facets of this response, there are seven results, all from Reference_Guides
related to Databases
, exactly as we knew before making the query.
Search and filtering with hierarchical facets with Technology as a sub-facet of Document_Type
You can choose the hierarchical relationship between different facets at the time of querying. Let’s define Technology
as the sub-facet of Document_Type
, as shown in the following code. This hierarchical relationship would be important for a Document_Type
-focused user such as a technical writer.
The query API call is made as follows:
The formatted response to this is as follows:
The results are classified as per their Document_Type
followed by Technology. In other words, reversing the hierarchical relationship results in transposing the matrix of scope of results as shown by the preceding facets. Six results are from Reference_Guides
related to Compute. Let’s define the filter as follows:
We use this filter to make the query API call:
The formatted response to this is as follows:
The results contain six Reference_Guides
related to Compute
, exactly as we knew before running the query.
Clean up
To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Amazon S3, delete that data source. If you created an Amazon S3 bucket to store the data used, delete that as well.
Conclusion
You can use Amazon Kendra hierarchical facets to define a hierarchical relationship between attributes to provide granular information about the scope of the results in the response to a query. This enables you to make an informed filtering choice to narrow down the search results and find the documents you’re looking for quickly.
To learn more about facets and filters in Amazon Kendra, refer to the Filtering queries.
For more information on how you can automatically create, modify, or delete metadata, which you can use for faceting the search results, refer to Customizing document metadata during the ingestion process and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.
About the Authors
Abhinav Jawadekar is a Principal Solutions Architect focused on Amazon Kendra in the AI/ML language services team at AWS. Abhinav works with AWS customers and partners to help them build intelligent search solutions on AWS.
Ji Kim is a Software Development Engineer at Amazon Web Services and is a member of the Amazon Kendra team.