On Improving Broadcast Joins in Apache Spark SQL

On Improving Broadcast Joins in Apache Spark SQL

10.551 Lượt nghe
On Improving Broadcast Joins in Apache Spark SQL
Broadcast join is an important part of Spark SQL’s execution engine. When used, it performs a join on two relations by first broadcasting the smaller one to all Spark executors, then evaluating the join criteria with each executor’s partitions of the other relation. When the broadcasted relation is small enough, broadcast joins are fast, as they require minimal data shuffling. Above a certain threshold however, broadcast joins tend to be less reliable or performant than shuffle-based join algorithms, due to bottlenecks in network and memory usage. This talk shares the improvements Workday has made to increase the threshold of relation size under which broadcast joins in Spark are practical. We cover topics such as executor-side broadcast, where data is broadcasted directly between executors instead of being first collected to the driver, as well as the changes we made in Spark’s whole-stage code generator to accommodate the increased threshold for broadcasting. Specifically, we highlight how we limited the memory usage of broadcast joins in executors, so that we can increase the threshold without increasing executor memory. Furthermore, we present our findings from running these improvements in production, on large scale jobs compiled from complex ETL pipelines. From this session, the audience will be able to peek into internals of Spark’s join infrastructure, and take advantage of our learnings to better tune their own workloads. About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Read more here: https://databricks.com/product/unifie... Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner