Broadcast Joins & AQE (Adaptive Query Execution)

Broadcast Joins & AQE (Adaptive Query Execution)

9.645 Lượt nghe
Broadcast Joins & AQE (Adaptive Query Execution)
Spark Performance Tuning Welcome back to another engaging apache spark tutorial! In this apache spark performance optimization hands on tutorial, we dive deep into the techniques to fix data skew, focusing on Adaptive Query Execution (AQE) and broadcast join. AQE, a feature introduced in Spark 3.0, uses runtime statistics to select the most efficient query plan, optimizing shuffle partitions, joins, and skewed joins. We will discuss how Spark coalesces partitions, converts sort merge joins into broadcast joins, and splits larger partitions into smaller ones to optimize skewed joins. We will walk through the Spark documentation to understand the properties that need to be set to true for Spark to dynamically handle skew in a sort mode join. Then, we will look at an example joining two datasets, transaction and customer, to analyze how the join will look with and without AQE. By the end of this video, you will have a solid understanding of AQE, how to optimize skewed joins, and how to set up a Spark session to handle data skews. Key Takeaways: Understanding Adaptive Query Execution (AQE) and its benefits. How to optimize shuffle partitions and joins using AQE. Setting up a Spark session and properties to handle data skew dynamically. Analyzing the distribution of data and identifying skewed partitions. Comparing the performance of sort merge join with and without AQE. 📄 Complete Code on GitHub: https://github.com/afaqueahmad7117/spark-experiments/blob/main/spark/1_data_skew/3_solving_data_skew_aqe_broadcast.ipynb 🎥 Full Spark Performance Tuning Playlist: https://www.youtube.com/playlist?list=PLWAuYt0wgRcLCtWzUxNg4BjnYlCZNEVth 🔗 LinkedIn: https://www.linkedin.com/in/afaque-ahmad-5a5847129/ Chapters: 00:00 Introduction 00:35 What is AQE? 04:25 Sort-Merge-Join of Customer & Transaction Dataset 06:00 Spark UI showing Data Skew 06:41Join of Customer & Transaction Dataset (AQE enabled) 07:04 Code + Spark UI - Comparing Join Performance (with & without AQE) 10:52 Broadcast Join 11:18 Internal Working of Sort Merge Join 13:12 Concept of Hash Partitioning 14:47 Sort Merge Join example 17:12 Broadcast Join example 19:44 Code for Broadcast join fixing Data Skew #DataEngineering #AdaptiveQueryExecution #DataSkew #BroadcastJoin #spark #apachespark #dataengineering #sparkperformancetuning #dataengineering #interviewquestions #dataengineerinterviewquestions #azuredataengineer #dataanalystinterview