AWS Big Data Blog
Sharpen your Skill Set with Apache Spark on the AWS Big Data Blog
The AWS Big Data Blog has a large community of authors who are passionate about Apache Spark and who regularly publish content that helps customers use Spark to build real-world solutions. You’ll see content on a variety of topics, including deep-dives on Spark’s internals, building Spark Streaming applications, creating machine learning pipelines using MLlib, and ways to apply Spark to various real-world use cases. You can learn hands-on by creating distributed applications using code samples from the blog directly against data in Amazon S3, and you can run Spark on Amazon EMR to enable fast experimentation and quick production deployments.
The latest releases of Spark are supported within a few weeks of Apache general availability (Spark 1.6.1 was included in EMR 4.5 last week). Spark on EMR is configured by default to use dynamic allocation of executors to efficiently utilize available resources, it can utilize EMRFS to efficiently query data in Amazon S3, and it can be used with interactive notebooks when you’re also installing Apache Zeppelin on your cluster.
Below are recent posts that focus on Spark:
- Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
- Real-time Stream Processing Using Apache Spark Streaming and Apache Kafka on AWS
- Use Spark 2.0, Hive 2.1 on Tez, and the latest from the Hadoop ecosystem on Amazon EMR
- Installing and Running JobServer for Apache Spark on Amazon EMR
- Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
- Analyze Realtime Data from Amazon Kinesis Streams Using Zeppelin and Spark Streaming
- Use Apache Oozie Workflows to Automate Apache Spark Jobs (and more!) on Amazon EMR
- Will Spark Power the Data behind Precision Medicine?
- Using Spark SQL for ETL
- Using Python 3.4 on EMR Spark Applications
- Crunching Statistics at Scale with SparkR on Amazon EMR
- Anomaly Detection Using PySpark, Hive, and Hue on Amazon EMR
- Submitting User Applications with spark-submit
- Optimize Spark-Streaming to Efficiently Process Amazon Kinesis Streams
- Analyze Your Data on Amazon DynamoDB with Apache Spark
- Running an External Zeppelin Instance using S3 Backed Notebooks with Spark on Amazon EMR
- Large-Scale Machine Learning with Spark on Amazon EMR
- Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming
- Analyze Realtime Data from Amazon Kinesis Streams Using Zeppelin and Spark Streaming
We hope these posts help you learn more about the Spark ecosystem and demonstrate ways to leverage these technologies on AWS to help you derive value from your data. And with new posts coming out every week, stay tuned for new Spark use cases and examples!
Please let us know in the comments below if you’d like us to cover specific Spark-related topics. If you have questions about Spark on EMR, please email us at emr-help@amazon.com and we’ll get back to you right away.