AWS Big Data Blog
Category: Learning Levels
Monitor Apache HBase on Amazon EMR using Amazon Managed Service for Prometheus and Amazon Managed Grafana
Amazon EMR provides a managed Apache Hadoop framework that makes it straightforward, fast, and cost-effective to run Apache HBase. Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. It is an open-source, non-relational, versioned database that runs on top of the Apache Hadoop Distributed File System (HDFS). It’s built […]
How OLX Group migrated to Amazon Redshift RA3 for simpler, faster, and more cost-effective analytics
This is a guest post by Miguel Chin, Data Engineering Manager at OLX Group and David Greenshtein, Specialist Solutions Architect for Analytics, AWS. OLX Group is one of the world’s fastest-growing networks of online marketplaces, operating in over 30 countries around the world. We help people buy and sell cars, find housing, get jobs, buy […]
Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow
This post was co-written with Amit Shah, Principal Consultant at Atos. Customers across industries seek meaningful insights from the data captured in their Customer Relationship Management (CRM) systems. To achieve this, they combine their CRM data with a wealth of information already available in their data warehouse, enterprise systems, or other software as a service […]
Use fuzzy string matching to approximate duplicate records in Amazon Redshift
It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources. One challenge you may face when performing analytics is the presence of imperfect duplicate records within the source data. This post presents one possible approach to addressing this challenge in an Amazon Redshift data warehouse using fuzzy matching.
Automate schema evolution at scale with Apache Hudi in AWS Glue
In the data analytics space, organizations often deal with many tables in different databases and file formats to hold data for different business functions. Business needs often drive table structure, such as schema evolution (the addition of new columns, removal of existing columns, update of column names, and so on) for some of these tables […]
Improve observability across Amazon MWAA tasks
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud at scale. A data pipeline is a set of tasks and processes used to automate the movement and transformation of data between different systems. […]
Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena
Since its launch in 2006, Amazon Simple Storage Service (Amazon S3) has experienced major growth, supporting multiple use cases such as hosting websites, creating data lakes, serving as object storage for consumer applications, storing logs, and archiving data. As the application portfolio grows, customers tend to store data from multiple application and different business functions […]
How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics
Every day, Amazon devices process and analyze billions of transactions from global shipping, inventory, capacity, supply, sales, marketing, producers, and customer service teams. This data is used in procuring devices’ inventory to meet Amazon customers’ demands. With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics […]
Amazon EMR launches support for Amazon EC2 C7g (Graviton3) instances to improve cost performance for Spark workloads by 7–13%
Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over twice the performance improvements compared to open-source Apache Spark and Presto. With Amazon EMR release 6.7, you can […]
AWS Lake Formation 2022 year in review
Data governance is the collection of policies, processes, and systems that organizations use to ensure the quality and appropriate handling of their data throughout its lifecycle for the purpose of generating business value. Data governance is increasingly top-of-mind for customers as they recognize data as one of their most important assets. Effective data governance enables […]