AWS Big Data Blog

Category: Learning Levels

New Amazon CloudWatch log class to cost-effectively scale your AWS Glue workloads

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. One of […]

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Step Functions is a fully managed visual workflow service that enables you to build complex data processing pipelines involving a diverse set of extract, transform, and load (ETL) technologies such as AWS Glue, Amazon EMR, and Amazon Redshift. You can visually build the workflow by wiring individual data pipeline tasks and configuring payloads, retries, […]

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue

With the exponential growth of data, companies are handling huge volumes and a wide variety of data including personally identifiable information (PII). PII is a legal term pertaining to information that can identify, contact, or locate a single person. Identifying and protecting sensitive data at scale has become increasingly complex, expensive, and time-consuming. Organizations have […]

Architecture Diagram

Federate IAM-based single sign-on to Amazon Redshift role-based access control with Okta

Amazon Redshift accelerates your time to insights with fast, easy, and secure cloud data warehousing at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries. You can use your preferred SQL clients to analyze your data in an Amazon Redshift data warehouse. Connect seamlessly by […]

AWS Clean Rooms proof of concept scoping part 1: media measurement

In this post, we outline planning a POC to measure media effectiveness in a paid advertising campaign. The collaborators are a media owner (“CTV.Co,” a connected TV provider) and brand advertiser (“Coffee.Co,” a quick service restaurant company), that are analyzing their collective data to understand the impact on sales as a result of an advertising campaign.

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt focuses on the transform layer of extract, load, transform (ELT) or extract, transform, load (ETL) processes across data warehouses and databases through specific engine adapters to achieve extract and load functionality. It […]

Improve performance of workloads containing repetitive scan filters with multidimensional data layout sort keys in Amazon Redshift

Amazon Redshift, a widely used cloud data warehouse, has evolved significantly to meet the performance requirements of the most demanding workloads. This post covers one such new feature—the multidimensional data layout sort key. Amazon Redshift now improves your query performance by supporting multidimensional data layout sort keys, which is a new type of sort key […]

Amazon MSK now provides up to 29% more throughput and up to 24% lower costs with AWS Graviton3 support

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data. Today, we’re excited to bring the benefits of Graviton3 to Kafka workloads, with Amazon MSK now offering M7g instances for new MSK provisioned clusters. AWS Graviton […]

Enhance query performance using AWS Glue Data Catalog column-level statistics

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum, resulting in improved query performance and potential cost savings. Data lakes are designed for storing vast amounts […]

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

For any modern data-driven company, having smooth data integration pipelines is crucial. These pipelines pull data from various sources, transform it, and load it into destination systems for analytics and reporting. When running properly, it provides timely and trustworthy information. However, without vigilance, the varying data volumes, characteristics, and application behavior can cause data pipelines […]