AWS Machine Learning Blog

Take your intelligent search experience to the next level with Amazon Kendra hierarchical facets

Unstructured data continues to grow in many organizations, making it a challenge for users to get the information they need. Amazon Kendra is a highly accurate, intelligent search service powered by machine learning (ML). Amazon Kendra uses deep learning and reading comprehension to deliver precise answers, and returns a list of ranked documents that match the search query for you to choose from. To help users interactively narrow down the list of relevant documents, you can assign metadata at the time of document ingestion to provide filtering and faceting capabilities.

In a search solution with a growing number of documents, simple faceting or filtering isn’t always sufficient to enable users to really pinpoint documents with the information they’re looking for. Amazon Kendra now features hierarchical facets, with a more granular view of the scope of the search results. Hierarchical facets offer filtering options with more details about the number of results expected for each option, and allows users to further narrow their search, pinpointing their documents of interest quickly.

In this post, we demonstrate what hierarchical facets in Amazon Kendra can do. We first ingest a set of documents, along with their metadata, into an Amazon Kendra index. We then make search queries using both simple and hierarchical facets, and add filtering to get straight to the documents of interest.

Solution overview

Instead of presenting each facet individually as a list, hierarchical facets enable defining a parent-child relationship between facets to shape the scope of the search results. With this, you see the number of results that not only have a particular facet, but also have each of the sub-facets. Let’s take the example of a repository of AWS documents of types User_Guides, Reference_Guides and Release_Notes, regarding compute, storage, and database technologies.

First let’s look at non-hierarchical facets from the response to a search query:

Technology
  Databases:23
  Storage:22
  Compute:15
Document_Type
  User_Guides:37
  Reference_Guides:18
  Release_Notes:5

Here we know the number of search results in each of the technologies, as well as each of the document types. However, we don’t know, for example, how many results to expect from User_Guides related to Storage, except that it’s going to be less than 22, as the smaller of the number of results from User_Guides:37 and from Storage:22.

Now let’s look at hierarchical facets from the response to the same search query:

Technology
  Databases:23
    Document_Type
      User_Guides:12
      Reference_Guides:7
      Release_Notes:4
  Storage:22
    Document_Type
      User_Guides:16
      Reference_Guides:6
  Compute:15
    Document_Type
      User_Guides:9
      Reference_Guides:5
      Release_Notes:1

With hierarchical facets, we get more information in terms of the number results from each document type about each technology. With this additional information, we know that there are 16 results from User_Guides related to Storage.

In the subsequent sections, we use this example to demonstrate the use of hierarchical facets to narrow down search results along with step-by-step instructions you can follow to try this out in your own AWS account. If you just want to read about this feature without running it yourself, you can refer to the Python script facet-search-query.py used in this post, and its output output.txt, and then jump to the section Search and filtering using facets without hierarchy.

Prerequisites

To deploy and experiment with the solution in this post, make sure that you have the following:

Set up the infrastructure and run the Python script to query the Amazon Kendra index

To set up the solution, complete the following steps:

  1. Use the AWS Management Console for Amazon S3 to create an S3 bucket to use as a data source to store the sample documents.
  2. On the AWS Management Console, start CloudShell by choosing the shell icon on the navigation bar.
    Alternatively, you can run the Python script from any computer that has the AWS SDK for Python (Boto3) installed and an AWS account with access to the Amazon Kendra index. Make sure to update Boto3 on your computer. For simplicity, the step-by-step instructions in this post focus on CloudShell.
  3. After CloudShell starts, download facet-search-query.py to your local machine.
  4. Upload the script to your CloudShell by switching to the CloudShell tab, choosing the Actions menu, and choosing Upload file.
  5. Download hierarchical-facets-data.zip to your local machine, unzip it, and upload the entire directory structure to your S3 bucket.
  6. If you’re not using an existing Amazon Kendra index, create a new Amazon Kendra index.
  7. On the Amazon Kendra console, open your index.
  8. In the navigation pane, choose Facet definition.
  9. Choose Add field.
  10. Configure the field Document_Type and choose Add.
  11. Configure the field Technology and choose Add.
  12. Configure your S3 bucket as a data source to the Amazon Kendra index you just created.
  13. Sync the data source and wait for the sync to complete.
  14. Switch to the CloudShell tab.
  15. Update Boto3 by running pip3 install boto3=1.23.1 --upgrade.
    This ensures that CloudShell has a version of Boto3 that supports hierarchical facets.
  16. Edit facet-search-query.py and replace REPLACE-WITH-YOUR-AMAZON-KENDRA-INDEX-ID with your Amazon Kendra index ID.
    You can get the index ID by opening your index details on the Amazon Kendra console.
  17. In the CloudShell prompt, run facet-search-query.py using the command python3 facet-search-query.py | tee output.txt.

If this step is canceled with the error Unknown parameter in Facets[0]: “Facets”, must be one of: DocumentAttributeKey,
choose the Actions menu, and choose Delete AWS CloudShell home directory. Repeat the steps to download facet-search-query.py, update Boto3, edit facet-search-query.py, and run it again. If you have any other data in the CloudShell home directory, you should back it up before running this step.

For convenience, all the steps are included in one Python script. You can read facet-search-query.py and experiment by copying parts of this script and making your own scripts. Edit output.txt to observe the search results.

Search and filtering with facets without hierarchy

Let’s start by querying with facets having no hierarchy. In this case, the facets parameter used in the query only provides the information that the results in the response should be faceted using two attributes: Technology and Document_Type. See the following code:

fac0 = [
    { "DocumentAttributeKey":"Technology" },
    { "DocumentAttributeKey":"Document_Type" }
]

This is used as a parameter to the query API call:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac0)

The formatted version of the response is as follows:

Query:  How to encrypt data?
Number of results: 62
Document Title:  developerguide
Document Attributes:
  Document_Type: User_Guides
  Technology: Databases
Document Excerpt:
  4. Choose the option that you want for encryption at rest. Whichever
  option you choose, you can't   change it after the cluster is
  created. • To encrypt data at rest in this cluster, choose Enable
  encryption. • If you don't want to encrypt data at rest in this
  cluster, choose Disable encryption.
----------------------------------------------------------------------
Facets:
  Technology
    Databases:23
    Storage:22
    Compute:16
  Document_Type
    User_Guides:37
    Reference_Guides:19
    Release_Notes:5
======================================================================

The first result from the response is from a User_Guide about Databases. The facets below the result show the number of results for Technology and Document_Type present in the response.

Let’s narrow down these results to be only from User_Guides and Storage by setting the filter as follows:

att_filter0 = {
    "AndAllFilters": [
        {
            "EqualsTo":{
                "Key": "Technology",
                "Value": {
                    "StringValue": "Storage"
                }
            }
        },
        {
            "EqualsTo":{
                "Key": "Document_Type",
                "Value": {
                    "StringValue": "User_Guides"
                }
            }
        }
    ]
}

Now let’s make a query call using the facets without hierarchy and the preceding filter:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac0, AttributeFilter=att_filter0)

A formatted version of the response is as follows:

Query:  How to encrypt data?
Query Filter: Technology: Storage AND Document_Type: User_Guides
Number of results: 18
Document Title:  efs-ug
Document Attributes:
  Document_Type: User_Guides
  Technology: Storage
Document Excerpt:
  ,             "Action": [                 "kms:Describe*",
  "kms:Get*",                 "kms:List*",
  "kms:RevokeGrant"             ],             "Resource": "*"
  }     ] }   Encrypting data in transit You can encrypt data in
  transit using an Amazon EFS file sys
----------------------------------------------------------------------
Facets:
  Technology
    Storage:16
  Document_Type
    User_Guides:16

The response contains 16 results from User_Guides on Storage. Based on the non-hierarchical facets in the response without filters, we only knew to expect fewer than 22 results.

Search and filtering with hierarchical facets with Document_Type as a sub-facet of Technology

Now let’s run a query using hierarchical facets, with the relationship of Document_Type being a sub-facet of Technology. This hierarchical relationship is important for a Technology-focused user such as an engineer. Note the nested facets in the following definition. The MaxResults parameter is used to display only top MaxResults facets. For our example, there are only three facets for Technology and Document_Type, therefore this parameter isn’t particularly useful. When the number of facets is high, it makes sense to use this parameter.

fac1 = [{
    "DocumentAttributeKey":"Technology",
    "Facets":[{
        "DocumentAttributeKey":"Document_Type",
        "MaxResults": max_results
    }],
}]

The query API call is made as follows:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac1)

The formatted version of the response is as follows:

Document Attributes:
  Document_Type: User_Guides
  Technology: Databases
Document Excerpt:
  4. Choose the option that you want for encryption at rest. Whichever
  option you choose, you can't   change it after the cluster is
  created. • To encrypt data at rest in this cluster, choose Enable
  encryption. • If you don't want to encrypt data at rest in this
  cluster, choose Disable encryption.
----------------------------------------------------------------------
Facets:
  Technology
    Databases:23
      Document_Type
        User_Guides:12
        Reference_Guides:7
        Release_Notes:4
    Storage:22
      Document_Type
        User_Guides:16
        Reference_Guides:6
    Compute:16
      Document_Type
        User_Guides:9
        Reference_Guides:6
        Release_Notes:1
======================================================================

The results are classified as per the Technology facet followed by Document_Type. In this case, looking at the facets, we know that 16 results are from User_Guides about Storage and 7 are from Reference_Guides related to Databases.

Let’s narrow down these results to be only from Reference_Guides related to Databases using the following filter:

att_filter1 = {
    "AndAllFilters": [
        {
            "EqualsTo":{
                "Key": "Technology",
                "Value": {
                    "StringValue": "Databases"
                }
            }
        },
        {
            "EqualsTo":{
                "Key": "Document_Type",
                "Value": {
                    "StringValue": "Reference_Guides"
                }
            }
        }
    ]
}

Now let’s make a query API call using the hierarchical facets with this filter:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac1, AttributeFilter=att_filter1)

The formatted response to this is as follows:

Query:  How to encrypt data?
Query Filter: Technology: Databases AND Document_Type: Reference_Guides
Number of results: 7
Document Title:  redshift-api
Document Attributes:
  Document_Type: Reference_Guides
  Technology: Databases
Document Excerpt:
  ...Constraints: Maximum length of 2147483647.   Required: No
  KmsKeyId   The AWS Key Management Service (KMS) key ID of the
  encryption key that you want to use to encrypt data in the cluster.
  Type: String   Length Constraints: Maximum length of 2147483647.
  Required: No LoadSampleData   A flag...
----------------------------------------------------------------------
Facets:
  Technology
    Databases:7
      Document_Type
        Reference_Guides:7
======================================================================

From the facets of this response, there are seven results, all from Reference_Guides related to Databases, exactly as we knew before making the query.

Search and filtering with hierarchical facets with Technology as a sub-facet of Document_Type

You can choose the hierarchical relationship between different facets at the time of querying. Let’s define Technology as the sub-facet of Document_Type, as shown in the following code. This hierarchical relationship would be important for a Document_Type-focused user such as a technical writer.

fac2 = [{
    "DocumentAttributeKey":"Document_Type",
    "Facets":[{
        "DocumentAttributeKey":"Technology",
        "MaxResults": max_results
    }]
}]

The query API call is made as follows:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac2)

The formatted response to this is as follows:

Query:  How to encrypt data?
Number of results: 62
Document Title:  developerguide
Document Attributes:
  Document_Type: User_Guides
  Technology: Databases
Document Excerpt:
  4. Choose the option that you want for encryption at rest. Whichever
  option you choose, you can't   change it after the cluster is
  created. • To encrypt data at rest in this cluster, choose Enable
  encryption. • If you don't want to encrypt data at rest in this
  cluster, choose Disable encryption.
----------------------------------------------------------------------
Facets:
  Document_Type
    User_Guides:37
      Technology
        Storage:16
        Databases:12
        Compute:9
    Reference_Guides:19
      Technology
        Databases:7
        Compute:6
        Storage:6
    Release_Notes:5
      Technology
        Databases:4
        Compute:1
======================================================================

The results are classified as per their Document_Type followed by Technology. In other words, reversing the hierarchical relationship results in transposing the matrix of scope of results as shown by the preceding facets. Six results are from Reference_Guides related to Compute. Let’s define the filter as follows:

att_filter2 = {
    "AndAllFilters": [
        {
            "EqualsTo":{
                "Key": "Document_Type",
                "Value": {
                    "StringValue": "Reference_Guides"
                }
            }
        },
        {
            "EqualsTo":{
                "Key": "Technology",
                "Value": {
                    "StringValue": "Compute"
                }
            }
        }
    ]
}

We use this filter to make the query API call:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac2, AttributeFilter=att_filter2)

The formatted response to this is as follows:

Query:  How to encrypt data?
Query Filter: Document_Type: Reference_Guides AND Technology:Compute
Number of results: 7
Document Title:  ecr-api
Document Attributes:
  Document_Type: Reference_Guides
  Technology: Compute
Document Excerpt:
  When you use AWS KMS to encrypt your data, you can either use the
  default AWS managed AWS KMS key for Amazon ECR, or specify your own
  AWS KMS key, which you already created. For more information, see
  Protecting data using server-side encryption with an AWS KMS key
  stored in AWS Key Management Service
----------------------------------------------------------------------
Facets:
  Document_Type
    Reference_Guides:6
      Technology
        Compute:6
======================================================================

The results contain six Reference_Guides related to Compute, exactly as we knew before running the query.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Amazon S3, delete that data source. If you created an Amazon S3 bucket to store the data used, delete that as well.

Conclusion

You can use Amazon Kendra hierarchical facets to define a hierarchical relationship between attributes to provide granular information about the scope of the results in the response to a query. This enables you to make an informed filtering choice to narrow down the search results and find the documents you’re looking for quickly.

To learn more about facets and filters in Amazon Kendra, refer to the Filtering queries.

For more information on how you can automatically create, modify, or delete metadata, which you can use for faceting the search results, refer to Customizing document metadata during the ingestion process and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.


About the Authors

Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect focused on Amazon Kendra in the AI/ML language services team at AWS. Abhinav works with AWS customers and partners to help them build intelligent search solutions on AWS.

Ji Kim is a Software Development Engineer at Amazon Web Services and is a member of the Amazon Kendra team.