AWS Big Data Blog
Category: Amazon Simple Storage Service (S3)
Turning Amazon EMR into a Massive Amazon S3 Processing Engine with Campanile
Michael Wallman is a senior consultant with AWS ProServ Have you ever had to copy a huge Amazon S3 bucket to another account or region? Or create a list based on object name or size? How about mapping a function over millions of objects? Amazon EMR to the rescue! EMR allows you to deploy large […]
Integrating Amazon Kinesis, Amazon S3 and Amazon Redshift with Cascading on Amazon EMR
This is a guest post by Ryan Desmond, Solutions Architect at Concurrent. Concurrent is an AWS Advanced Technology Partner. With Amazon Kinesis developers can quickly store, collate and access large, distributed data streams such as access logs, click streams and IoT data in real-time. The question then becomes, how can we access and leverage this […]
Building and Maintaining an Amazon S3 Metadata Index without Servers
Mike Deck is a Solutions Architect with AWS Amazon S3 is a simple key-based object store whose scalability and low cost make it ideal for storing large datasets. Its design enables S3 to provide excellent performance for storing and retrieving objects based on a known key. Finding objects based on other attributes, however, requires doing […]
Building Scalable and Responsive Big Data Interfaces with AWS Lambda
This is a guest post by Martin Holste, a co-founder of the Threat Analytics Platform at FireEye where he is a senior researcher specializing in prototypes. Overview At FireEye, Inc., we process billions of security events every day with our Threat Analytics Platform, running on AWS. In building our platform, one of the problems we […]
How Expedia Implemented Near Real-time Analysis of Interdependent Datasets
This is a guest post by Stephen Verstraete, a manager at Pariveda Solutions. Pariveda Solutions is an AWS Premier Consulting Partner. Common patterns exist for batch processing and real-time processing of Big Data. However, we haven’t seen patterns that allow us to process batches of dependent data in real-time. Expedia’s marketing group needed to analyze […]
Nasdaq’s Architecture using Amazon EMR and Amazon S3 for Ad Hoc Access to a Massive Data Set
This is a guest post by Nate Sammons, a Principal Architect for Nasdaq The Nasdaq group of companies operates financial exchanges around the world and processes large volumes of data every day. We run a wide variety of analytic and surveillance systems, all of which require access to essentially the same data sets. The Nasdaq […]
Using AWS for Multi-instance, Multi-part Uploads
James Saull is a Principal Solutions Architect with AWS There are many advantages to using multi-part, multi-instance uploads for large files. First, the throughput is improved because you can upload parts in parallel. Amazon Simple Storage Service (Amazon S3) can store files up to 5TB, yet a single machine with a 1Gbps interface would take […]
Moving Big Data Into The Cloud with ExpeDat Gateway for Amazon S3
Matt Yanchyshyn is a Principal Solutions Architect with Amazon Web Services Introduction A previous blog post (Moving Big Data Into the Cloud with Tsunami UDP) discussed how Tsunami UDP is a fast and easy way to move large amounts of data to and from AWS. Specifically, we showed how you can use it to move […]
Moving Big Data into the Cloud with Tsunami UDP
Matt Yanchyshyn is a Principal Solutions Architect with Amazon Web Services AWS Solutions Architect Leo Zhadanovsky also contributed to this post. Introduction One of the biggest challenges facing companies that want to leverage the scale and elasticity of AWS for analytics is how to move their data into the cloud. It’s increasingly common to have […]