Master Reading Spark Query Plans

Master Reading Spark Query Plans

55.577 Lượt nghe
Master Reading Spark Query Plans
Spark Performance Tuning Dive deep into Apache Spark Query Plans to better understand how Apache Spark operates under the hood. We'll cover how Spark creates logical and physical plans, as well as the role of the Catalyst Optimizer in utilizing optimization techniques such as filter (predicate) pushdown and projection pushdown. The video covers intermediate concepts of Apache Spark in-depth, detailed explanations on how to read the Spark UI, understand Apache Spark’s query plans through code snippets of various narrow and wide transformations like reading files, select, filter, join, group by, repartition, coalesce, hash partitioning, hashaggregate, round robin partitioning, range partitioning and sort-merge join. Understanding them is going to give you a grasp on reading Spark’s step-by-step thought process and help identify performance issues and possible optimizations. 📄 Complete Code on GitHub: https://github.com/afaqueahmad7117/spark-experiments/blob/main/spark/2_reading_query_plans.ipynb 🎥 Full Spark Performance Tuning Playlist: https://www.youtube.com/playlist?list=PLWAuYt0wgRcLCtWzUxNg4BjnYlCZNEVth 🔗 LinkedIn: https://www.linkedin.com/in/afaque-ahmad-5a5847129 Chapters: 00:00 Introduction 01:30 How Spark generates logical and physical plans? 04:46 Narrow transformations (filter, select, add or update columns) query plan explanation 09:02 Repartition query plan explanation 12:57 Coalesce query plan explanation 17:32 Joins query plan explanation 23:23 Group by count query plan explanation 27:04 Group by sum query plan explanation 28:05 Group by count distinct query plan explanation 33:59 Interesting observations on Spark’s query plans 36:56 When will predicate pushdown not work? 39:07 Thank you #ApacheSpark #SparkPerformanceTuning #DataEngineering #SparkDAG #SparkOptimization #dataengineering #interviewquestions #azuredataengineer