AWS Machine Learning Blog

Index website contents using the Amazon Q Web Crawler connector for Amazon Q Business

Amazon Q Business is a fully managed service that lets you build interactive chat applications using your enterprise data. These applications can generate answers based on your data or a large language model (LLM) knowledge. Your data is not used for training purposes, and the answers provided by Amazon Q Business are based solely on the data users have access to.

Enterprise data is often distributed across different sources, such as documents in Amazon Simple Storage Service (Amazon S3) buckets, database engines, websites, and more. In this post, we demonstrate how to create an Amazon Q Business application and index website contents using the Amazon Q Web Crawler connector for Amazon Q Business.

For this example, we use two data sources (websites). The first data source is an employee onboarding guide from a fictitious company, which requires basic authentication. We demonstrate how to set up authentication for the Web Crawler. The second data source is the official documentation for Amazon Q Business. For this data source, we demonstrate how to apply advanced settings, such as regular expressions, to instruct the Web Crawler to crawl only pages and links related to Amazon Q Business, ignoring pages related to other AWS services.

Overview of the Amazon Q Web Crawler connector

The Amazon Q Web Crawler connector makes it possible to crawl websites that use HTTPS and index their contents so you can build a generative artificial intelligence (AI) experience for your users based on the indexed data. This connector relies on the Selenium Web Crawler Package and a Chromium driver. The connector is fully managed and updates to these components are applied automatically without your intervention.

This connector crawls and indexes the contents of webpages and attachments. Amazon Q Business supports multiple connectors, and each connector has its own properties and entities that it considers documents. In the context of the Web Crawler connector, a document refers to a single page or attachment contents. Separately, an index is commonly referred to as a corpus of documents; think of it as the place where you add and sync your documents for Amazon Q Business to use for generating answers to user requests.

Each document has its own attributes, also known as metadata. Metadata can be mapped to fields in your Amazon Q Business index. By creating index fields, you can boost results based on document attributes. For example, there might be use cases where you want to give more relevance to results from a specific category, department, or creation date.

Amazon Q Business data source connectors are designed to crawl the default attributes in your data source automatically. You can also add custom document attributes and map them to custom fields in your index. To learn more, see Mapping document attributes in Amazon Q Business.

For a better understanding of what is indexed by the Web Crawler connector, we present a list of metadata indexed from webpages and attachments.

The following table lists webpage metadata indexed by the Amazon Q Web Crawler connector.

Field Data Source Field Amazon Q Business Index Field (reserved) Field Type
Category category _category String
URL sourceUrl _source_uri String
Title title _document_title String
Meta Tags metaTags wc_meta_tags String List
File Size htmlSize wc_html_size Long (numeric)

The following table lists attachments metadata indexed by the Amazon Q Web Crawler connector.

Field Data Source Field Amazon Q Business Index Field (reserved) Field Type
Category category _category String
URL sourceUrl _source_uri String
File Name fileName wc_file_name String
File Type fileType wc_file_type String
File Size fileSize wc_file_size Long (numeric)

When configuring the data source for your website, you can use URLs or sitemaps, which can be defined either manually or using a text file stored in Amazon S3.

To enforce secure access to protected websites, the Amazon Q Web Crawler supports the following authentication types and standards:

  • Basic authentication
  • NTLM/Kerberos authentication
  • Form-based authentication
  • SAML authentication

Unlike other data source connectors, the Amazon Q Web Crawler connector doesn’t support access control list (ACL) crawling or identity crawling.

Lastly, you have a range of options for configuring how and what data is synchronized. For example, you can choose to synchronize website domains only, website domains with subdomains only, or website domains with subdomains and the webpages included in links. Additionally, you can use regular expressions to filter which URLS to include or exclude in the crawling process.

Overview of solution

On a high level, this solution consists of an Amazon Q Business application that utilizes two data sources: a website hosting documents related to an employee onboarding guide, and the Amazon Q Business official documentation website. This solution demonstrates how to configure both websites as data sources for the Amazon Q Business application. The following steps will be performed:

  1. Deploy an AWS CloudFormation template containing a static website secured with basic authentication.
  2. Create an Amazon Q Business application.
  3. Create a Web Crawler data source for the Amazon Q Business documentation.
  4. Create a Web Crawler data source for the employee onboarding guide.
  5. Add groups and users to the Amazon Q Business application.
  6. Run sample queries to test the solution.

You can follow along using one or both data sources provided in this post or try your own URLs.

Prerequisites

To follow along with this demo, you should have the following prerequisites:

  • An AWS account with privileges to create Amazon Q Business applications and AWS Identity and Access Management (IAM) roles and policies
  • An IAM Identity Center instance with at least one user (and optionally, one or more groups)
  • If you decide to use a public website, make sure you have permission to crawl the website
  • Optionally, privileges to deploy CloudFormation templates

Deploy a CloudFormation template for the employee onboarding website secured with basic authentication

Deploying this CloudFormation template is optional, but we recommend using it so you can learn more about how the Web Crawler connector works with websites that require authentication.

We start by deploying a CloudFormation template. This template will create a simple static website secured with basic authentication.

  1. On the AWS CloudFormation console, choose Create stack and choose With new resources (standard).
  2. Select Choose an existing template.
  3. For Specify template, select Amazon S3 URL.
  4. For Amazon S3 URL enter the URL https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-16532/template-website.yml
  5. Choose Next.
  6. For Stack name, enter a name. For example, onboarding-website-for-q-business-sample.
  7. Choose Next.
  8. Leave all options in Configure stack options as default and choose Next.
  9. On the Review and create page, select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.

The deployment process will take a few minutes to complete. You can move to the next section of this post while it’s in process. Keep this tab open—you’ll need to refer to the Outputs tab later.

Create an Amazon Q Business application

Before you start creating Amazon Q Business applications, you are required to enable and configure an IAM Identity Center instance. This step is mandatory because Amazon Q Business integrates with IAM Identity Center to manage user access to your Amazon Q Business applications. If you don’t have an IAM Identity Center instance set up when trying to create your first application, you will see the option to create one, as shown in the following screenshot.

Create IAM Identity Center

If you already have an IAM Identity Center instance set up, you’re ready to start creating your first application by following these steps:

  1. On a new tab in your browser, open the Amazon Q Business console.
  2. Choose Get started or Create application (options will vary based on whether it’s your first time trying the service).
  3. For Application name¸ enter a name for your application, for example, my-q-business-app.
  4. For Service access, select Create and use a new service-linked role (SLR).
  5. Choose Create.
  6. For Retrievers, select Use native retriever.
  7. For Index provisioning, enter 1 for Number of units. One unit can index 20,000 documents (a document in this context is either a single page of content or a single attachment).
  8. Choose Next.

Create a Web Crawler data source for the Amazon Q Business documentation

After you complete the steps in the previous section, you should see the Connect data sources page, as shown in the following screenshot.

Connect data sources

If you closed the tab by accident, you can get to this page by navigating to the Amazon Q Business console, choosing your application name, and then choosing Add data source.

Let’s create the data source for the Amazon Q Business documentation website:

  1. On the Connect data sources page, choose Web crawler.
  2. For Data source name, enter a name, for example, q-business-documentation
  3. For Description, enter a description.
  4. For Source, you have the option to provide either URLs or sitemaps. For this example, select Source URLs and enter the URL of the official documentation of Amazon Q: https://docs.thinkwithwp.com/amazonq/

Starting point URLs can be added directly in this UI (up to 10), or you could use a file hosted in Amazon S3 to list up to 100 starting point URLs. Likewise, sitemap URLs can be added in this UI (up to three), or you could add up to three sitemap XML files hosted in Amazon S3.

We refer to source URLs as starting point URLs; later in this post, you’ll have the opportunity to define what gets crawled, for example, domains and subdomains that the webpages might link to. It’s worth mentioning that the Web Crawler connector can only work with HTTPS.

  1. Select No authentication in the Authentication section because this is a public website.
  2. The Web proxy section is optional, so we leave it empty.
  3. For Configure VPC and security group, select No VPC.
  4. In the IAM role section, choose Create a new service role.
  5. In the Sync scope section, for Sync domain range, select Sync domains with subdomains only.
  6. For Maximum file size, you can keep the default value of 50 MB.
  7. Under Additional configuration, expand Scope settings.
  8. Leave Crawl depth set to 2, Maximum links per page set to 999, and Maximum throttling set to 300.

If you open the Amazon Q official documentation, you’ll see that there are links to Amazon Q Developer documentation and other AWS services. Because we’re only interested in crawling Amazon Q Business, we need to instruct the crawler to focus only on relevant links and pages related to Amazon Q Business. To achieve this, we use regular expressions to define exactly what URLs the crawler should crawl.

  1. Under Crawl URL Patterns, enter the following expressions one by one, and choose Add:
    1. ^https:\/\/docs\.aws\.amazon\.com\/amazonq\/$
    2. ^https:\/\/docs\.aws\.amazon\.com\/amazonq\/latest\/qbusiness-ug\/.*\.html$
    3. ^https:\/\/docs\.aws\.amazon\.com\/amazonq\/latest\/business-use-dg\/.*\.html$

List of URLs to crawl

  1. In the Sync mode section, select Full sync. This option makes it possible to sync all contents regardless of their previous status.
  2. In the Sync run schedule section, you define how often Amazon Q Business should sync this data source. For Frequency, select Run on demand.

Choosing this option means you must manually run the sync operation; this option is suitable given the simplicity of this example. For production workloads, you’ll want to define a schedule tailored to your needs, for example, hourly, daily, or weekly, or you could define your own schedule using a cron expression.

  1. The Tags section is optional, so we leave it empty.

The default values in the Field mappings section can’t be changed at this point. This can only be modified after the application and retriever have been created.

  1. Choose Add data source and wait a couple of seconds while changes are applied.

After the data source is created, you will be shown the same interface you saw at the beginning of this section, with the note that one Web Crawler data source has been added. Keep this tab open, because you’ll create a second data source for the employee onboarding guide in the next section.

Web crawler added

Create a Web Crawler data source for the employee onboarding guide

Complete the following steps to create your second data source:

  1. On the Connect data sources page, choose Web crawler.
  2. Keep this tab open and navigate back to the AWS CloudFormation console tab and verify the stack’s status is CREATE_COMPLETE.
  3. If the status of the stack is CREATE_COMPLETE, choose the Outputs tab of the stack you deployed.
  4. Note the URL, user name, and password (the following screenshot shows sample values).

Website settings

  1. Choose the link for WebsiteURL.

Although unlikely, if the URL isn’t working, it might be because Amazon CloudFront hasn’t finished replicating the website. In that case, you should wait a couple of minutes and try again.

  1. Sign in with your user name and password.

Basic auth login form

You should now be able to browse the employee onboarding guide. Take a few minutes to get familiar with the contents of the website, because you’ll be asking your Amazon Q Business application questions about this content in a later step.

  1. Return to the browser tab where you’re creating the new data source.
  2. For Data source name, enter a name, for example, onboarding-guide.
  3. For Source, select Source URLs and enter the website URL you saved earlier.
  4. For Authentication, select Basic authentication.
  5. Under Authentication credentials, for AWS Secrets Manager secret, choose Create and add new secret.

Create and add secret

  1. For Secret name, enter a secret name of your preference.
  2. For User name and Password, use the values you saved earlier and make sure there are no extra whitespaces.
  3. Choose Save.

These credentials will be stored as a secret in AWS Secrets Manager.

Depending on the type of authentication you use, you’ll need certain fields present in your secret, as shown in the following table.

Authentication Type Fields present in secret
Form based username, password, userNameFieldXpath, passwordFieldXpath, passwordButtonXpath, loginPageUrl
NTLM username, password
Basic auth username, password
No Authentication NA
  1. Leave the Web proxy section empty.
  2. Select No VPC in the Configure VPC and security group
  3. For IAM role, choose Create a new service role.
  4. Select Sync domains with subdomains only in the Sync scope
  5. Select Full sync in the Sync mode
  6. For Sync run schedule, choose Run on demand.
  7. Leave the sections Tags and Field mappings with their default values.
  8. Choose Add data source and wait a couple of seconds while changes are applied.

After changes are applied, the Connect data sources page shows two Web Crawler data sources have been added.

Two web crawlers have been added

  1. Scroll down to the end of the page and choose Next.

We have added our two data sources. In the next section, we add groups and users to our Amazon Q Business application.

Add groups and users to the Amazon Q Business application

Complete the following steps to add groups and users:

  1. On the Add groups and users page, choose Add groups and users.
  2. Select Assign existing users and groups and choose Next.

If you’ve completed the prerequisite of setting up IAM Identity Center, you’ve likely added at least one user. Although it’s not mandatory, we recommend creating multiple users and groups. This will enable you to fully explore and understand all the features of Amazon Q Business beyond what’s covered in this post.

If you haven’t added any users to your Identity Center directory, you can create them here by choosing Add new users. However, you’ll need to complete additional steps, such as setting up their passwords on the IAM Identity Center console. To fully benefit from this tutorial, we recommend having active users and groups by the time you reach this step.

  1. In the search bar, enter either the display name or group name you want to add to the application.

Start typing name

  1. Choose the user (or group) and choose Assign.

If you added a group, you’ll see it on the Groups tab. If you added a user, you’ll see it on the Users tab.

The next step is choosing a subscription for your groups or users.

  1. Select the user (or group) you just added, and on the Current subscription dropdown menu, choose your subscription tier. For this example, we choose Q Business Pro.

Assign Q Business license

This is a good time to get familiar with the Amazon Q Business subscription tiers and pricing. For this example, we use Q Business Pro, but you could also use a Q Business Lite subscription.

  1. In the Web experience service access section, select Create and use a new service role.

A web experience is the chat interface that your users will utilize to ask questions and perform tasks.

  1. Choose Create application.

After the application is created successfully, you’ll be redirected to the Amazon Q Business console, where you can see your new application. Your application is ready, but the data sources haven’t synced any data yet. We’ll do that in the next steps.

  1. Choose the name of your new application to open the Application Details.

Q Business Application

  1. In the Data sources section, select each data source and choose Sync now.

You will see the Current sync state for both data sources as Syncing. This process might take several minutes.

After the data sources are synced, you will see their Last sync status as Completed.

Sync completed

You’re now ready to test your application! Keep this page open because you’ll need it for next steps.

Run sample queries to test the solution

At this point, you have created an Amazon Q Business application, added two data sources using the Amazon Q Web Crawler connector, added users to the application, and synchronized all data sources.

The next step is going through the full user experience of logging in to the application and running a few test queries to test our application.

  1. On the Application Details page, navigate to the Web experience settings
  2. Choose the link under Deployed URL.

Web experience settings tab

You’ll be redirected to the AWS access portal URL, which is set up by IAM Identity Center.

  1. Enter the user name of a user previously added to your Amazon Q Business application and choose Next.

You’re now on your Amazon Q Business app and ready to start asking questions!

  1. Enter your question (prompt) in the Enter a prompt text field and press Enter.

For this example, we start by asking questions related to the employee onboarding website.

Amazon Q Business Conversation

Amazon Q Business uses the onboarding guide data source you created earlier. If you choose Sources, you’ll see a list of in-text source citations in the form of a numbered list.

Now we ask questions related to the Amazon Q Business documentation.

Amazon Q Business conversation

Try it out with your own prompts!

Troubleshooting

In this section, we discuss several common issues and how to troubleshoot:

  • Amazon Q Business isn’t answering your questions – If Amazon Q Business isn’t answering your questions, it’s likely due to your data not being indexed correctly. To make sure your data has synced correctly, make sure your data sources have synced correctly.
  • The Web Crawler is unable to sync – If you used a starting point URL different from this post and the Web Crawler can’t sync, it might be due to permissions. If the website requires authentication, refer to the section where we create a data source for more information. Another common scenario is when settings on the web server or firewalls prevent the Web Crawler from accessing the data. Lastly, it’s recommended to check if a txt file on your web server is explicitly denying access to the Web Crawler. For more details on how to configure a robots.txt file, refer to Configuring a robots.txt file for Amazon Q Business Web Crawler.
  • Amazon Q Business answers questions using old data – When you create a data source, you have the option to tell Amazon Q Business how often it should sync your data source with your index. During the creation of our data sources, we chose to sync the data sources manually (Run on demand), which means the sync process will occur only when we choose Sync now on our data source. For more information, refer to Sync run schedule.
  • Amazon Q Business provides an inaccurate answer or no answer at all – In situations where Amazon Q Business is providing an inaccurate answer, incomplete answers, or no answer at all, we recommend looking at the format of the data. Is the data part of an image? Is the data in a tabular format? Amazon Q Business works best with unstructured, plain text data.

Document enrichment

Although not covered in this post, we recommend exploring document enrichment. This functionality allows you to manipulate and enrich document attributes prior to being added to an index. The following are a couple of ideas for advanced applications of document enrichment:

  • Run an AWS Lambda function that sends your document to Amazon Textract. This service uses optical character recognition (OCR) to extract text from images containing handwriting, forms, tables, and more.
  • Use Amazon Transcribe to convert videos or audio files in your documents into text.
  • Use Amazon Comprehend to detect and redact personal identifiable information (PII).

Clean up

After you finish testing the solution and to avoid incurring in extra costs, clean up the resources you created as part of this solution.

Let’s start by deleting the Amazon Q Business application.

  1. On the Amazon Q Business console, select your application from the application list and on the Actions menu, choose Delete.

Delete Q Business application

  1. Confirm its deletion by entering Delete, then choose Delete.

You might be asked to complete an optional survey on your reasons for application deletion. You are can select multiple reasons (or none), then choose Submit.

The next step is to delete the CloudFormation stack responsible for deploying the employee onboarding website we used as a data source.

  1. On the CloudFormation console, select the stack you created at the beginning of this walkthrough and choose Delete.

Delete Cloudformation stack

  1. Choose Delete to confirm the stack deletion.

The stack deletion might take a few minutes. When the deletion is complete, you’ll see the stack has been removed from your list of stacks.

Optionally, if you enabled IAM Identity Center only for this tutorial and want to delete your IAM Identity Center instance, follow these steps:

  1. On IAM Identity Center console, choose Settings in the navigation pane.

IAM identity center settings

  1. Choose the Management tab

IAM IDC management

  1. Choose Delete.
  1. Select the acknowledgement check boxes, enter your instance, and choose Confirm.

Conclusion

The Amazon Q Business Web Crawler allows you to connect websites to your Amazon Q Business applications. This connector supports multiple forms of authentication (if required by your website) and can run sync jobs on a defined schedule.

To learn more about Amazon Q Business and its features, refer to the Amazon Q Business Developer Guide. For a comprehensive list of what can be done with this connector, refer to Connecting Web Crawler to Amazon Q Business.


About the Author

Guillermo MansillaGuillermo Mansilla is a Senior Solutions Architect based in Orlando, Florida. He has had the opportunity to collaborate with startups and enterprise customers in the USA and Canada, assisting them in building and architecting their applications on AWS. Guillermo has developed a keen interest in serverless architectures and generative AI applications. Prior to his current role, he gained over a decade of experience working as a software developer. Away from work, Guillermo enjoys participating in chess tournaments at his local chess club, a pursuit that allows him to exercise his analytical skills in a different context.