AWS Big Data Blog
Category: Learning Levels
Simplify Amazon Redshift monitoring using the new unified SYS views
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud, providing up to five times better price-performance than any other cloud data warehouse, with performance innovation out of the box at no additional cost to you. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to […]
Build multi-layer maps in Amazon OpenSearch Service
With the release of Amazon OpenSearch Service 2.5, you can create maps with multiple layers to visualize your geographical data. You can build each layer from a different index pattern to separate data sources. Organizing the map in layers makes it more straightforward to visualize, view, and analyze geographical data. The layering also helps fetch […]
Run Spark SQL on Amazon Athena Spark
At AWS re:Invent 2022, Amazon Athena launched support for Apache Spark. With this launch, Amazon Athena supports two open-source query engines: Apache Spark and Trino. Athena Spark allows you to build Apache Spark applications using a simplified notebook experience on the Athena console or through Athena APIs. Athena Spark notebooks support PySpark and notebook magics […]
Resolve private DNS hostnames for Amazon MSK Connect
Amazon MSK Connect is a feature of Amazon Managed Streaming for Apache Kafka (Amazon MSK) that offers a fully managed Apache Kafka Connect environment on AWS. With MSK Connect, you can deploy fully managed connectors built for Kafka Connect that move data into or pull data from popular data stores like Amazon S3 and Amazon […]
SmugMug’s durable search pipelines for Amazon OpenSearch Service
SmugMug operates two very large online photo platforms, SmugMug and Flickr, enabling more than 100 million customers to safely store, search, share, and sell tens of billions of photos. Customers uploading and searching through decades of photos helped turn search into critical infrastructure, growing steadily since SmugMug first used Amazon CloudSearch in 2012, followed by […]
Load data incrementally from transactional data lakes to data warehouses
Data lakes and data warehouses are two of the most important data storage and management technologies in a modern data architecture. Data lakes store all of an organization’s data, regardless of its format or structure. An open table format such as Apache Hudi, Delta Lake, or Apache Iceberg is widely used to build data lakes […]
Enhance your security posture by storing Amazon Redshift admin credentials without human intervention using AWS Secrets Manager integration
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, […]
Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS
Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Using Spark SQL to run Hive workloads provides not only the simplicity of SQL-like queries but also taps into the exceptional speed and performance provided by Spark. Spark SQL is an Apache Spark module for structured data processing. One […]
Unleash the power of Snapshot Management to take automated snapshots using Amazon OpenSearch Service
Snapshot Management helps you create point-in-time backups of your domain using OpenSearch Dashboards, including both data and configuration settings (for visualizations and dashboards). You can use these snapshots to restore your cluster to a specific state, recover from potential failures, and even clone environments for testing or development purposes. In this post, we share how to use Snapshot Management to take automated snapshots using OpenSearch Service.
Processing large records with Amazon Kinesis Data Streams
In this post, we show you some different options for handling large records within Kinesis Data Streams and the benefits and disadvantages of each approach. We provide some sample code for each option to help you get started with any of these approaches with your own workloads.