Amazon Bedrockを使用し生成 AI メタデータで AWS Glue Data Catalog を強化する

本記事は、2024/11/15 に公開された Enrich your AWS Glue Data Catalog with generative AI metadata using Amazon Bedrock を翻訳したものです。翻訳は Solutions Architect の渡邉が担当しました。

メタデータは、データ資産を使用してデータ主導の意思決定を行う際に非常に重要な役割を果たします。多くの場合、データ資産のメタデータの生成は手作業であり時間がかかります。生成 AI を活用することで、ドキュメントに基づいたデータ資産の包括的なメタデータ生成を自動化し、AWS クラウド環境内のデータディスカバリー、データ理解、全体的なデータガバナンスを強化できます。本記事では、Amazon Bedrock 上の基盤モデル (FM) とデータドキュメントを使用し動的メタデータによって AWS Glue Data Catalog を強化する方法を説明します。

AWS Glue は、分析ユーザーが複数のソースからデータを簡単に検出、準備、移動、統合できるようにするサーバーレスデータ統合サービスです。 Amazon Bedrock は、単一の API を介して AI21 Labs、Anthropic、Cohere、Meta、Mistral AI、Stability AI、Amazon といった大手 AI 企業からの高性能な FM を選択できるフルマネージドサービスです。

ソリューションの概要

このソリューションでは、Amazon Bedrock を通じて大規模言語モデル (LLM) を使用し、データカタログ内のテーブル定義のメタデータを自動的に生成します。はじめに、LLM がドキュメントなしで要求されたメタデータを生成するコンテキスト内学習のオプションを模索します。次に、検索拡張生成 (RAG) を使用して LLM プロンプトにデータドキュメントを追加し、メタデータの生成を改善します。

AWS Glue Data Catalog

本記事では、さまざまなデータソースにわたるデータ資産の一元的なメタデータリポジトリである AWS Glue Data Catalog を使用します。AWS Glue Data Catalog は、データ形式、スキーマ、ソースに関する情報を保存およびクエリするための統合インターフェイスを提供します。これは、データソースの場所、スキーマ、およびランタイムメトリクスへのインデックスとして機能します。

データカタログにデータを追加する最も一般的な方法は、データソースを自動的に検出してカタログ化する AWS Glue クローラーを使用することです。クローラーを実行すると、指定したデータベースまたはデフォルトのデータベースに追加されるメタデータテーブルが作成されます。各テーブルは単一のデータストアを表しています。

生成 AI モデル

LLM(大規模言語モデル) は膨大な量のデータでトレーニングされ、数十億のパラメータを使用し質問への回答、言語の翻訳、文章の完成などの一般的なタスクの出力を生成します。メタデータ生成などの特定のタスクに LLM を使用するためには、期待する出力を生成するようにモデルをガイドするアプローチが必要です。

この投稿では、次の 2 つの異なるアプローチでデータの説明的なメタデータを生成する方法を説明します。

コンテキスト内学習
検索拡張生成 (RAG)

このソリューションでは Amazon Bedrock で利用可能な 2 つの生成 AI モデル (テキスト生成タスク用と Amazon Titan Embeddings V2 用) を使用します。

次のセクションでは、Python を使用した各アプローチの実装の詳細について説明します。付属のコードは GitHub リポジトリにあります。こちらは Amazon SageMaker Studio や JupyterLab、またはご自身の環境で段階的に実装できます。 SageMaker Studio を初めて使用する場合は、デフォルト設定で数分で起動できるクイックセットアップを確認してください。このコードは AWS Lambda 関数または独自のアプリケーションでも使用することができます。

アプローチ1: コンテキスト内学習

このアプローチでは、LLM を使用してメタデータの説明を生成します。プロンプトエンジニアリングを使用して、LLM に生成させたい出力を指示します。このアプローチは、テーブルの数が少ない AWS Glue データベースに最適です。コンテキストウィンドウ (ほとんどの Amazon Bedrock モデルが受け入れる入力トークンの数) を超えることなく、データカタログからテーブル情報をプロンプトのコンテキストとして送信できます。以下の図が、そのアーキテクチャとなります。

アプローチ2: 検索拡張生成(RAG)

数百のテーブルがある場合、すべてのデータカタログ情報をコンテキストとしてプロンプトに追加すると、LLM のコンテキストウィンドウを超えるプロンプトが表示される可能性があります。場合によっては、出力を生成する前に FM に参照してもらいたいビジネス要件ドキュメントや技術ドキュメントなどの追加コンテンツもあります。そのようなドキュメントは数ページに及ぶこともあり、通常ほとんどの LLM が受け入れられる入力トークンの最大数を超えます。そのため、そのままではプロンプトに含めることができません。

解決策として RAG アプローチの使用が挙げられます。RAG を使用すると応答を生成する前に学習データソース以外の権威あるナレッジベースを参照し LLM の出力を最適化できます。RAG はモデルをファインチューニングすることなく、LLMを特定のドメインまたは組織内部のナレッジベースに拡張します。これは LLM の出力を改善するための費用対効果の高いアプローチであり、LLM は様々なコンテキストにおいて適切かつ正確で有用なものとなります。

RAG を用いると LLM はメタデータを生成する前に、データに関する技術的なドキュメントやその他の情報を参照することができます。その結果、生成される説明はより豊かで正確なものになることが期待されます。

本記事の例では、公開されている Amazon Simple Storage Service (Amazon S3): s3://awsglue-datasets/examples/us-legislators/all からデータを取り込みます。このデータセットには、米国の議員に関するJSON形式のデータと彼らが米国下院と米国上院で保持した議席が含まれています。データドキュメントは Popolo (http://www.popoloproject.com/) から取得しました。

以下のアーキテクチャ図は、RAG アプローチを示しています。

流れは以下の通りです。

データドキュメントから情報を取り込みます。ドキュメントには様々な形式があり得ます。本記事ではドキュメントはウェブサイトになります。
データドキュメントのHTMLページのコンテンツをチャンクします。データドキュメントのベクトル埋め込みを生成し、保存します。
データカタログからデータベーステーブルの情報を取得します。
ベクトルストアで類似検索を行い、最も関連性の高い情報をベクトルストアから取得します。
プロンプトを構築します。メタデータの作成方法を指示し、取得した情報とデータカタログのテーブル情報をコンテキストとして追加します。今回は6つのテーブルを含むかなり小規模なデータベースであるため、データベースに関するすべての情報を含めます。
LLM にプロンプトを送信し応答を取得して、データカタログを更新します。

前提条件

本記事の手順に従って、ご自身の AWS アカウントにソリューションをデプロイする場合は、GitHub リポジトリを参照してください。

以下のリソースが必要となります:

AWSアカウント
Python と boto3
AWSGlueServiceRole ポリシーまたは同等のポリシーを含む、AWS Glue クローラー用の AWS Identity and Access Management（IAM）ロールと、本記事で使用するデータが保存されている S3 バケットにアクセスできるインラインポリシー

本記事では環境構築の一環として、aws-gen-ai-glue-metadata-<random_sequence> という名前で新しいS3バケットを作成します。以下はインラインポリシーの例です。

{
   "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Action": [
              "s3:GetObject",
              "s3:PutObject"
          ],
          "Resource": [
              "arn:aws:s3:::aws-gen-ai-glue-metadata-*/*"
          ]
        }
    ]
}

ノートブック環境のIAMロール。IAMロールは、AWS Glue、Amazon Bedrock、Amazon S3 に対して適切な権限を持つ必要があります。以下はポリシーの例です。ご自身の環境に合わせて、さらに条件を追加して制限をかけることができます。

{
      "Version": "2012-10-17",
      "Statement": [
           {
                 "Sid": "GluePermissions",
                 "Effect": "Allow",
                 "Action": [
                      "glue:GetCrawler",
                      "glue:DeleteDatabase",
                      "glue:GetTables",
                      "glue:DeleteCrawler",
                      "glue:StartCrawler",
                      "glue:CreateDatabase",
                      "glue:UpdateTable",
                      "glue:DeleteTable",
                      "glue:UpdateCrawler",
                      "glue:GetTable",
                      "glue:CreateCrawler"
                 ],
                 "Resource": "*"
           },
           {
                 "Sid": "S3Permissions",
                 "Effect": "Allow",
                 "Action": [
                      "s3:PutObject",
                      "s3:GetObject",
                      "s3:CreateBucket",
                      "s3:ListBucket",
                      "s3:DeleteObject",
                      "s3:DeleteBucket"
                 ],
                 "Resource": "arn:aws:s3:::<bucket_name>"
           },
           {
                 "Sid": "IAMPermissions",
                 "Effect": "Allow",
                 "Action": "iam:PassRole",
                 "Resource": "arn:aws:iam::<account_ID>:role/GlueCrawlerRoleBlog"

           },
           {
                 "Sid": "BedrockPermissions",
                 "Effect": "Allow",
                 "Action": "bedrock:InvokeModel",
                 "Resource": [
                      "arn:aws:bedrock:*::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0",
                      "arn:aws:bedrock:*::foundation-model/amazon.titan-embed-text-v2:0"
                 ]
           }
      ]
}

Amazon Bedrock における Anthropic の Claude 3 と Amazon Titan Text Embeddings V2 へのモデルアクセス。
how_to_generate_metadata_for_glue_data_catalog_w_bedrock.ipynb のノートブック

リソースと環境のセットアップ

以上が前提条件となり、次のステップを実行するためにノートブック環境に切り替えます。ノートブックのセットアップの手順では最初にノートブックが必要とする以下のリソースが作成されます。

S3 バケット
AWS Glue データベース
AWS Glue クローラー(自動的に実行され AWS Glue データベーステーブルを自動生成する)

セットアップが完了すると、legislatorsという AWS Glue データベースが作成されています。

クローラーは以下のメタデータテーブルを作成します。

persons
memberships
organizations
events
areas
countries

これは議員と彼らの経歴を含む半正規化されたテーブルの集合です。

ノートブックの残りの手順に従って環境のセットアップを完了させてください。数分で完了します。

データカタログの検査

セットアップが完了したら、データカタログを検査し、データカタログとメタデータを確認します。AWS Glueのコンソールで、ナビゲーションペインの Databases を選択し、新しく作成した legislators データベースを開きます。以下のスクリーンショットのように、6つのテーブルが含まれているはずです。：

テーブルを開いて詳細を確認できます。テーブルの説明とそれぞれのカラムに対するコメントは、AWS Glue クローラーによって自動的に補完されないため、空白になっています。

AWS Glue API を使用して、各テーブルの技術的なメタデータにプログラムでアクセスすることができます。以下のコードスニペットは、AWS SDK for Python (Boto3) で AWS Glue API を使用して選択したデータベースのテーブルを取得し、検証のために画面へ表示します。本記事のノートブックにある以下のコードは、データカタログ情報をプログラムで取得するために使用されます。

def get_alltables(database):
    tables = []
    get_tables_paginator = glue_client.get_paginator('get_tables')
    for page in get_tables_paginator.paginate(DatabaseName=database):
        tables.extend(page['TableList'])
    return tables

def json_serial(obj):
    if isinstance(obj, (datetime, date)):
        return obj.isoformat()
    raise TypeError ("Type %s not serializable" % type(obj))

database_tables =  get_alltables(database)

for table in database_tables:
    print(f"Table: {table['Name']}")
    print(f"Columns: {[col['Name'] for col in table['StorageDescriptor']['Columns']]}")

以上で AWS Glue データベースとテーブルを詳しく知ることができましたので、次のステップでは生成 AI を使ってテーブルのメタデータの説明を生成します。

Amazon Bedrock と LangChain を使い Anthropic Claude 3 でテーブルのメタデータ記述を生成する

このステップでは、AWS Glue データベースに存在する選択したテーブルの技術的なメタデータを生成します。この記事では persons テーブルを使用します。まず、データカタログから全てのテーブルを取得し、プロンプトの一部として含めます。このコードは1つのテーブルのメタデータを生成することを目的としていますが、LLM に外部キーを検出させたいため幅広い情報を与えることが有効です。ノートブック環境にLangChain v0.2.1をインストールします。以下のコードを確認してください。：

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from botocore.config import Config
from langchain_aws import ChatBedrock

glue_data_catalog = json.dumps(get_alltables(database),default=json_serial)


model_kwargs ={
    "temperature": 0.5, # You can increase or decrease this value depending on the amount of randomness you want injected into the response. A value closer to 1 increases the amount of randomness.
    "top_p": 0.999
}

model = ChatBedrock(
    client = bedrock_client,
    model_id=model_id,
    model_kwargs=model_kwargs
)

table = "persons"
response_get_table = glue_client.get_table( DatabaseName = database, Name = table )
pprint.pp(response_get_table)

user_msg_template_table="""
I'd like you to create metadata descriptions for the table called {table} in your AWS Glue data catalog. Please follow these steps:
1. Review the data catalog carefully
2. Use all the data catalog information to generate the table description
3. If a column is a primary key or foreign key to another table mention it in the description.
4. In your response, reply with the entire JSON object for the table {table}
5. Remove the DatabaseName, CreatedBy, IsRegisteredWithLakeFormation, CatalogId,VersionId,IsMultiDialectView,CreateTime, UpdateTime.
6. Write the table description in the Description attribute
7. List all the table columns under the attribute "StorageDescriptor" and then the attribute Columns. Add Location, InputFormat, and SerdeInfo
8. For each column in the StorageDescriptor, add the attribute "Comment". If a table uses a composite primary key, then the order of a given column in a table’s primary key is listed in parentheses following the column name.
9. Your response must be a valid JSON object.
10. Ensure that the data is accurately represented and properly formatted within the JSON structure. The resulting JSON table should provide a clear, structured overview of the information presented in the original text.
11. If you cannot think of an accurate description of a column, say 'not available'
Here is the data catalog json in <glue_data_catalog></glue_data_catalog> tags.
<glue_data_catalog>
{data_catalog}
</glue_data_catalog>
Here is some additional information about the database in <notes></notes> tags.
<notes>
Typically foreign key columns consist of the name of the table plus the id suffix
<notes>
"""
messages = [
    ("system", "You are a helpful assistant"),
    ("user", user_msg_template_table),
]

prompt = ChatPromptTemplate.from_messages(messages)

chain = prompt | model | StrOutputParser()

# Chain Invoke

TableInputFromLLM = chain.invoke({"data_catalog": {glue_data_catalog}, "table":table})
print(TableInputFromLLM)

前述のコードでは、データカタログ更新 API が期待する TableInput オブジェクトに適した JSON レスポンスを提供するように LLM に指示しました。以下はレスポンスの例です。：

{
  "Name": "persons",
  "Description": "This table contains information about individual persons, including their names, identifiers, contact details, and other relevant personal data.",
  "StorageDescriptor": {
    "Columns": [
      {
        "Name": "family_name",
        "Type": "string",
        "Comment": "The family name or surname of the person."
      },
      {
        "Name": "name",
        "Type": "string",
        "Comment": "The full name of the person."
      },
      {
        "Name": "links",
        "Type": "array<struct<note:string,url:string>>",
        "Comment": "An array of links related to the person, containing a note and URL."
      },
      {
        "Name": "gender",
        "Type": "string",
        "Comment": "The gender of the person."
      },
      {
        "Name": "image",
        "Type": "string",
        "Comment": "A URL or path to an image of the person."
      },
      {
        "Name": "identifiers",
        "Type": "array<struct<scheme:string,identifier:string>>",
        "Comment": "An array of identifiers for the person, each with a scheme and identifier value."
      },
      {
        "Name": "other_names",
        "Type": "array<struct<lang:string,note:string,name:string>>",
        "Comment": "An array of other names the person may be known by, including the language, a note, and the name itself."
      },

      {
        "Name": "sort_name",
        "Type": "string",
        "Comment": "The name to be used for sorting or alphabetical ordering."
      },
      {
        "Name": "images",
        "Type": "array<struct<url:string>>",
        "Comment": "An array of URLs or paths to additional images of the person."
      },
      {
        "Name": "given_name",
        "Type": "string",
        "Comment": "The given name or first name of the person."
      },
      {
        "Name": "birth_date",
        "Type": "string",
        "Comment": "The date of birth of the person."
      },
      {
        "Name": "id",
        "Type": "string",
        "Comment": "The unique identifier for the person (likely a primary key)."
      },
      {
        "Name": "contact_details",
        "Type": "array<struct<type:string,value:string>>",
        "Comment": "An array of contact details for the person, including the type (e.g., email, phone) and the value."
      },
      {
        "Name": "death_date",
        "Type": "string",
        "Comment": "The date of death of the person, if applicable."
      }
    ],
    "Location": "s3://<your-s3-bucket>/persons/",
    "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
    "SerdeInfo": {
      "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe",
      "Parameters": {
        "paths": "birth_date,contact_details,death_date,family_name,gender,given_name,id,identifiers,image,images,links,name,other_names,sort_name"
      }
    }
  },
  "PartitionKeys": [],
  "TableType": "EXTERNAL_TABLE"
}

また、生成された JSON は AWS Glue API が期待するフォーマットに準拠しているか検証することもできます。：

from jsonschema import validate

schema_table_input = {
    "type": "object",
    "properties" : {
            "Name" : {"type" : "string"},
            "Description" : {"type" : "string"},
            "StorageDescriptor" : {
            "Columns" : {"type" : "array"},
            "Location" : {"type" : "string"} ,
            "InputFormat": {"type" : "string"} ,
            "SerdeInfo": {"type" : "object"}
        }
    }
}
validate(instance=json.loads(TableInputFromLLM), schema=schema_table_input)

これでテーブルとカラムの説明が生成されたので、データカタログを更新することができます。

データカタログのメタデータを更新する

このステップでは、AWS Glue API を使用してデータカタログを更新します。：

response = glue_client.update_table(DatabaseName=database, TableInput= json.loads(TableInputFromLLM) )
print(f"Table {table} metadata updated!")

以下のスクリーンショットは、persons テーブルのメタデータとその説明を示しています。

以下のスクリーンショットは、テーブルのメタデータとしてカラムの説明を表示しています。

以上でデータカタログに保存されている技術的なメタデータが充実したので、さらに外部ドキュメントを追加して説明を改善します。

RAG で外部のドキュメントを追加しメタデータの説明を改善する

このステップでは、より正確なメタデータを生成するために外部のドキュメントを追加します。私たちのデータセットのドキュメントは HTML としてオンラインで見つけられます。HTML の読み込みには LangChain HTML ローダーを使います。：

from langchain_community.document_loaders import AsyncHtmlLoader

# We will use an HTML Community loader to load the external documentation stored on HTLM
urls = ["http://www.popoloproject.com/specs/person.html", "http://docs.everypolitician.org/data_structure.html",'http://www.popoloproject.com/specs/organization.html','http://www.popoloproject.com/specs/membership.html','http://www.popoloproject.com/specs/area.html']
loader = AsyncHtmlLoader(urls)
docs = loader.load()

ドキュメントをダウンロードしたら、チャンクに分割します。：

text_splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size=1000,
    chunk_overlap=200,

)
split_docs = text_splitter.split_documents(docs)

embedding_model = BedrockEmbeddings(
    client=bedrock_client,
    model_id=embeddings_model_id
)

次に、ドキュメントをベクトル化してローカルに保存し、類似検索を実行します。本番ワークロードでは Amazon OpenSearch Service のようなベクトルストアのマネージドサービスや、Amazon Bedrock Knowledge Bases のような RAG アーキテクチャを実装するためのフルマネージドソリューションを使用することができます。

vs = FAISS.from_documents(split_docs, embedding_model)
search_results = vs.similarity_search(
    'What standards are used in the dataset?', k=2
)
print(search_results[0].page_content)

次に、より正確なメタデータを生成するためにカタログ情報をドキュメントとともに含めます。：

from operator import itemgetter
from langchain_core.callbacks import BaseCallbackHandler
from typing import Dict, List, Any


class PromptHandler(BaseCallbackHandler):
    def on_llm_start( self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any) -> Any:
        output = "\n".join(prompts)
        print(output)

system = "You are a helpful assistant. You do not generate any harmful content."
# specify a user message
user_msg_rag = """
Here is the guidance document you should reference when answering the user:

<documentation>{context}</documentation>
I'd like to you create metadata descriptions for the table called {table} in your AWS Glue data catalog. Please follow these steps:

1. Review the data catalog carefully.
2. Use all the data catalog information and the documentation to generate the table description.
3. If a column is a primary key or foreign key to another table mention it in the description.
4. In your response, reply with the entire JSON object for the table {table}
5. Remove the DatabaseName, CreatedBy, IsRegisteredWithLakeFormation, CatalogId,VersionId,IsMultiDialectView,CreateTime, UpdateTime.
6. Write the table description in the Description attribute. Ensure you use any relevant information from the <documentation>
7. List all the table columns under the attribute "StorageDescriptor" and then the attribute Columns. Add Location, InputFormat, and SerdeInfo
8. For each column in the StorageDescriptor, add the attribute "Comment". If a table uses a composite primary key, then the order of a given column in a table’s primary key is listed in parentheses following the column name.
9. Your response must be a valid JSON object.
10. Ensure that the data is accurately represented and properly formatted within the JSON structure. The resulting JSON table should provide a clear, structured overview of the information presented in the original text.
11. If you cannot think of an accurate description of a column, say 'not available'
<glue_data_catalog>
{data_catalog}
</glue_data_catalog>
Here is some additional information about the database in <notes></notes> tags.
<notes>
Typically foreign key columns consist of the name of the table plus the id suffix
<notes>
"""
messages = [
    ("system", system),
    ("user", user_msg_rag),
]
prompt = ChatPromptTemplate.from_messages(messages)

# Retrieve and Generate
retriever = vs.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3},
)

chain = (  
    {"context": itemgetter("table")| retriever, "data_catalog": itemgetter("data_catalog"), "table": itemgetter("table")}
    | prompt
    | model
    | StrOutputParser()
)

TableInputFromLLM = chain.invoke({"data_catalog":glue_data_catalog, "table":table})
print(TableInputFromLLM)

以下は LLM からのレスポンスです。：

{
  "Name": "persons",
  "Description": "This table contains information about individual persons, including their names, identifiers, contact details, and other personal information. It follows the Popolo data specification for representing persons involved in government and organizations. The 'person_id' column relates a person to an organization through the 'memberships' table.",
  "StorageDescriptor": {
    "Columns": [
      {
        "Name": "family_name",
        "Type": "string",
        "Comment": "The family or last name of the person."
      },
      {
        "Name": "name",
        "Type": "string",
        "Comment": "The full name of the person."
      },
      {
        "Name": "links",
        "Type": "array<struct<note:string,url:string>>",
        "Comment": "An array of links related to the person, with a note and URL for each link."
      },
      {
        "Name": "gender",
        "Type": "string",
        "Comment": "The gender of the person."
      },
      {
        "Name": "image",
        "Type": "string",
        "Comment": "A URL or path to an image representing the person."
      },
      {
        "Name": "identifiers",
        "Type": "array<struct<scheme:string,identifier:string>>",
        "Comment": "An array of identifiers for the person, with a scheme and identifier value for each."
      },
      {
        "Name": "other_names",
        "Type": "array<struct<lang:string,note:string,name:string>>",
        "Comment": "An array of other names the person may be known by, with language, note, and name for each."
      },
      {
        "Name": "sort_name",
        "Type": "string",
        "Comment": "The name to be used for sorting or alphabetical ordering of the person."
      },
      {
        "Name": "images",
        "Type": "array<struct<url:string>>",
        "Comment": "An array of URLs or paths to additional images representing the person."
      },
      {
        "Name": "given_name",
        "Type": "string",
        "Comment": "The given or first name of the person."
      },
      {
        "Name": "birth_date",
        "Type": "string",
        "Comment": "The date of birth of the person."
      },
      {
        "Name": "id",
        "Type": "string",
        "Comment": "The unique identifier for the person. This is likely a primary key."
      },
      {
        "Name": "contact_details",
        "Type": "array<struct<type:string,value:string>>",
        "Comment": "An array of contact details for the person, with a type and value for each."
      },
      {
        "Name": "death_date",
        "Type": "string",
        "Comment": "The date of death of the person, if applicable."
      }
    ],
    "Location": "s3:<your-s3-bucket>/persons/",
    "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
    "SerdeInfo": {
      "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
    }
  }
}

1つ目のアプローチと同様に、出力が AWS Glue API に適合しているか確認するための検証をすることができます。

新しいメタデータでデータカタログを更新する

これでメタデータが生成されたので、データカタログを更新できます。

response = glue_client.update_table(DatabaseName=database, TableInput= json.loads(TableInputFromLLM) )
print(f"Table {table} metadata updated!")

生成された技術的なメタデータを確認します。persons テーブルのデータカタログに新しいバージョンが表示されているはずです。スキーマのバージョンには AWS Glue コンソールからアクセスできます。

今回の persons テーブルの説明を確認してください。その前に入力された説明と若干異なっているはずです。

コンテキスト内学習によるテーブルの説明 – “This table contains information about persons, including their names, identifiers, contact details, birth and death dates, and associated images and links. The ‘id’ column is the primary key for this table.”
RAG によるテーブルの説明 – “This table contains information about individual persons, including their names, identifiers, contact details, and other personal information. It follows the Popolo data specification for representing persons involved in government and organizations. The ‘person_id’ column relates a person to an organization through the ‘memberships’ table.”

LLM は、LLM に提供されたドキュメントの一部である Popolo の仕様に対する知識を表現しました。

クリーンアップ

以上、本記事でご紹介したステップが完了しましたら無駄なコストがかからないように、ノートブックの Clean Up で提供されたコードを使って忘れずにリソースを削除してください。

まとめ

本記事では生成 AI、特に Amazon Bedrock FM を使用しデータカタログを動的メタデータで充実させ、既存のデータ資産のデータディスカバリーとデータ理解を向上させる方法を探りました。私たちが実演した2つのアプローチ、コンテキスト内学習と RAG は、このソリューションの柔軟性と汎用性を示しています。コンテキスト内学習はテーブル数が少ない AWS Glue データベースに対して有効であるのに対し、RAGアプローチはより正確で詳細なメタデータを生成するために外部ドキュメントを使用するため、より大規模で複雑なデータランドスケープに適しています。このソリューションを導入することで新たなレベルのデータインテリジェンスを開放し、組織におけるより多くのデータに基づいた意思決定、データドリブンなイノベーションの推進、そしてデータの価値を最大限に引き出すことができます。この記事でご紹介したリソースや推奨事項をご確認いただき、データマネジメントの実践を強化することにお役立ていただければ幸いです。

著者について

Manos Samatas は、AWS のデータ・AI 部門のプリンシパルソリューションアーキテクトです。英国の政府、非営利団体、教育、ヘルスケアのお客様とデータおよび AI のプロジェクトに携わり、AWS を使ったソリューションの構築を支援しています。ロンドン在住。余暇は読書、スポーツ観戦、ビデオゲーム、友人との交流を楽しんでいます。

Anastasia Tzeveleka は、AWS の GenAI/ML のシニアスペシャリストソリューションアーキテクトです。彼女は仕事の一環として EMEA 全域のお客様が AWS サービスを使用して FM (基盤モデル)を構築し、スケーラブルな生成 AI と機械学習のソリューションを作成することを支援しています。

Amazon Web Services ブログ