df['text'].iloc[2215]
'El 86% de las empresas españolas comprometidas con los Objetivos de Desarrollo
Sostenible comprometidas con los Objetivos de Desarrollo Sostenible comprometidas
con los Objetivos de Desarrollo Sostenible comprometidas con los Objetivos de
Desarrollo Sostenible'
# 确保标题中没有重复文本
def remove_duplicates(text):
return re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', text, flags=re.I)
df['text'] = df['text'].apply(remove_duplicates)
# 只保留选定的语言
languages = ['English', 'Spanish', 'Danish']
df = df.loc[df['lang'].isin(languages)]
# 选取最长的80篇文章
df['text_length'] = df['text'].str.len()
df.sort_values(by=['text_length'], ascending=False, inplace=True)
top_80_df = df[:80]
# 语言分布
top_80_df['lang'].value_counts()
我们的文档列表很好地分布在这三种语言中:
lang
Spanish 33
English 29
Danish 18
Name: count, dtype: int64
以下是我们数据集中最长的文章标题:
top_80_df['text'].iloc[0]
"CFOdirect: Resultater fra PwC's Employee Engagement Landscape Survey, herunder hvordan
man skaber mere engagement blandt medarbejdere. Læs desuden om de regnskabsmæssige
konsekvenser for indkomstskat ifbm. Brexit"
queries = [
"Are businessees meeting sustainability goals?",
"Can data science help meet sustainability goals?"
]
for query in queries:
retrieval(query)
结果如下:
QUERY: ARE BUSINESSES MEETING SUSTAINABILITY GOALS?
ORIGINAL (English): Quality of business reporting on the Sustainable Development Goals
improves, but has a long way to go to meet and drive targets.
----
ORIGINAL (English): Only 10 years to achieve Sustainable Development Goals but
businesses remain on starting blocks for integration and progress
----
ORIGINAL (Spanish): Integrar los criterios ESG y el propósito en la estrategia
principal reto de los Consejos de las empresas españolas en el mundo post-COVID
TRANSLATION: Integrate ESG criteria and purpose into the main challenge strategy
of the Boards of Spanish companies in the post-COVID world
----
END OF RESULTS
QUERY: CAN DATA SCIENCE HELP MEET SUSTAINABILITY GOALS?
ORIGINAL (English): Using AI to better manage the environment could reduce greenhouse
gas emissions, boost global GDP by up to 38m jobs by 2030
----
ORIGINAL (English): Quality of business reporting on the Sustainable Development Goals
improves, but has a long way to go to meet and drive targets.
----
ORIGINAL (English): Only 10 years to achieve Sustainable Development Goals but
businesses remain on starting blocks for integration and progress
----
END OF RESULTS
query = "Hvor kan jeg finde den seneste danske boligplan?" # "在哪里可以找到最新的丹麦房地产计划?"
retrieved_docs, translated_retrieved_docs = retrieval(query)
QUERY: HVOR KAN JEG FINDE DEN SENESTE DANSKE BOLIGPLAN?
ORIGINAL (Danish): Nyt fra CFOdirect: Ny PP&E-guide, FAQs om den nye leasingstandard,
podcast om udfordringerne ved implementering af leasingstandarden og meget mere
TRANSLATION: New from CFOdirect: New PP&E guide, FAQs on the new leasing standard,
podcast on the challenges of implementing the leasing standard and much more
----
ORIGINAL (Danish): Lovforslag fremlagt om rentefri lån, udskudt frist for
lønsumsafgift, førtidig udbetaling af skattekredit og loft på indestående på
skattekontoen
TRANSLATION: Legislative proposal presented on interest-free loans, deferred payroll
tax deadline, early payment of tax credit and ceiling on deposits in the tax account
----
ORIGINAL (Danish): Nyt fra CFOdirect: Shareholder-spørgsmål til ledelsen, SEC
cybersikkerhedsguide, den amerikanske skattereform og meget mere
TRANSLATION: New from CFOdirect: Shareholder questions for management, the SEC
cybersecurity guide, US tax reform and more
----
END OF RESULTS
在上面的示例中,英文缩写“PP&E”代表”property, plant and equipment”(房地产、厂房和设备),我们的模型能够将其与我们的查询联系起来。
query = "Are companies ready for the next down market?"
retrieved_docs, translated_retrieved_docs = retrieval(query)
QUERY: ARE COMPANIES READY FOR THE NEXT DOWN MARKET?
ORIGINAL (Spanish): El valor en bolsa de las 100 mayores empresas cotizadas cae un 15%
entre enero y marzo pero aguanta el embate del COVID-19
TRANSLATION: The stock market value of the 100 largest listed companies falls 15%
between January and March but withstands the onslaught of COVID-19
----
ORIGINAL (English): 69% of business leaders have experienced a corporate crisis in the
last five years yet 29% of companies have no staff dedicated to crisis preparedness
----
ORIGINAL (English): As work sites slowly start to reopen, CFOs are concerned about the
global economy and a potential new COVID-19 wave - PwC survey
----
END OF RESULTS
# 映射模型包ARN
import boto3
cohere_package = "cohere-rerank-multilingual-v2--8b26a507962f3adb98ea9ac44cb70be1" # 替换为您的信息
model_package_map = {
"us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}",
"us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}",
"us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}",
"us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}",
"ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}",
"eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}",
"eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}",
"eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}",
"eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}",
"eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}",
"ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}",
"ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}",
"ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}",
"ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}",
"ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}",
"sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}",
}
region = boto3.Session().region_name
if region not in model_package_map.keys():
raise Exception(f"Current boto3 session region {region} is not supported.")
model_package_arn = model_package_map[region]
co = cohere_aws.Client(region_name=region)
co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-rerank-multilingual", instance_type="ml.g4dn.xlarge", n_instances=1)
当我们将文档传递给 Rerank 时,模型能够准确地选择最相关的文档:
results = co.rerank(query=query, documents=retrieved_docs, top_n=1)
for hit in results:
print(hit.document['text'])
69% of business leaders have experienced a corporate crisis in the last five years yet
29% of companies have no staff dedicated to crisis preparedness