Apache Spark End-To-End Data Engineering Project | Apple Data Analysis

99.512 Lượt nghe
00:00
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
Tải MP3
MÔ TẢ MP3TIẾP THEO
Apache Spark End-To-End Data Engineering Project | Apple Data Analysis
Dive into the world of big data processing with our PySpark Practice playlist. This series is designed for both beginners and seasoned data professionals looking to sharpen their Apache Spark skills through scenario-based questions and challenges. 

Each video provides step-by-step solutions to real-world problems, helping you master PySpark techniques and improve your data-handling capabilities. Whether preparing for a job interview or just learning more about Spark, this playlist is your go-to resource for practical, hands-on learning. Join us to become a PySpark expert!

In this video, we used DataBricks to create multiple ETL pipelines using the Python API of Apache Spark i.e. PySpark.

We have used sources like CSV, Parquet, and Delta Table then used Factory Pattern to create the reader class. Factory Pattern is one of the most used Low-Level designs in Data Engineering pipelines that involve multiple sources.

Then we used PySpark DataFrame API and Spark SQL to write the business transformation logic. In the loader part, we have loaded data into two fashion one using DataLake and another by Data LakeHouse. 

While solving the problems, we are also demonstrating the most asked PySpark #interview problems. We have discussed and demonstrated a lot of concepts like broadcast join, partition by and bucketing, sparkSession, windows functions like LAG and LEAD, delta table and many other concepts. 

After watching, please let us know your thoughts,

Stay tuned to all to this playlist for all upcoming videos. 

𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮:
🔅 Topmate (For collaboration and Scheduling calls) - https://topmate.io/ankur_ranjan
🔅 LinkedIn - https://www.linkedin.com/in/thebigdatashow
🔅 Instagram - https://www.instagram.com/ranjan_anku/

DataBricks notebooks link. Extract the zip folder by downloading it and then open the HTML files as a notebook in the community version of Databricks. 

🔅 Recommended Link for DataBricks community version login, after signing up:
https://community.cloud.databricks.com/

🔅 Ankur's Notebook source files 
https://drive.google.com/file/d/15FBgxq705uAOYDgY61urRf3m_ma3hJec/view?usp=sharing

🔅 Input table files
https://drive.google.com/drive/folders/1G46IBQCCi5-ukNDwF4KkX4qHtDNgrdn6

For practising different Data Engineering interview questions, go to the community section of our YouTube page.  

https://www.youtube.com/@TheBigDataShow/community

Narrow vs Wide Transformation

Short Article link: 
https://www.youtube.com/post/UgkxORdDnlDnjXQZJZTX4fXFTArZuMTax5Xt

Questions 1:
https://www.youtube.com/post/UgkxD7nX9pxdFwrm2L7qDu7bg6V4zlEivAki


Question 2: 
https://www.youtube.com/post/UgkxOrZ3zClcLy__L4zI1sA5axv2NoK7K-W4

Question 3:
https://www.youtube.com/post/UgkxQgVAp4XwG8epqIAozk9JcPflhJVk-Hlm

Question 4:
https://www.youtube.com/post/UgkxIaBfwpw4maJ2fCH3BJl-7Y9260e_irJ4

Question 5:
https://www.youtube.com/post/Ugkxz6eBqKD1AzvV1qX6OutenFGmjkyyT0hF

Question 6:
https://www.youtube.com/post/UgkxOiSXVx4cVmxL56ZBpCs5Z1AVwsZurA2C

Question 7:
https://www.youtube.com/post/UgkxiebQB6LxzhufaYR46DG1UbvRQ_4jSeHu

Question 8:
https://www.youtube.com/post/UgkxzUpBB6PLeC7v0u-qMvoAICE9go27Q-g_

Question 9:
https://www.youtube.com/post/UgkxZiWzepo7WhXVT1OwOnK6wdVVCVw5ys2t

Question 10:
https://www.youtube.com/post/UgkxwZ_iL0RUUANGPXGJTIbK7f_qv02YsirB

Broadcast Join in #apachespark 

Small article link:

https://www.youtube.com/post/Ugkx9Cjyr88rszIfXLop1YebK5Uus0MfZnRj


MCQs list 

1. https://www.youtube.com/channel/UCnVhEl576fIHgfneb1KdugA/community?lb=Ugkxiuj7Q9wcn9rrYYmBsHpEkGxeBzjFzydo

2. https://www.youtube.com/channel/UCnVhEl576fIHgfneb1KdugA/community?lb=UgkxFljj2l_4FF-GgFs36s655m2Vf_A-69U7

3. https://www.youtube.com/channel/UCnVhEl576fIHgfneb1KdugA/community?lb=Ugkxef8jGrl0HuSe0OkgG715rqyVSq2pmn_Y

4. https://www.youtube.com/channel/UCnVhEl576fIHgfneb1KdugA/community?lb=Ugkx4DLiWcq8cs0GUq-GpKbMTUFvXMAmB7wH

5. 
https://www.youtube.com/channel/UCnVhEl576fIHgfneb1KdugA/community?lb=Ugkxv4sNY3FhjaqSiGUALSu_Y_iwqduIxAS-

Check the COMMUNITY section for a full list of questions.


Chapters
00:00 - Project Introduction
12:04 - How to use Databricks for any Pyspark/Spark Project?
25:09 - Low-Level Design Code
40:39 - Job, Stages, and Action in Spark
45:22 - Designing a code base for the Spark Project
51:40 - Applying first business Logic in the transformer class
57:34 - Difference between Lag & Lead window function
01:28:42 - Broadcast Join in Apache Spark/pyspark
01:47:50 - Difference between Partitioning and Bucketing in Apache Spark/pyspark
2:07:00 - Detailed Summary of the first pipeline
2:14:00 - Second pipeline Goal
02:24:57 - collect_set() and collect_list() in Spark/pyspark
02:48:53 - Detailed Summary of the second pipeline
02:51:03 - Why is Delta Lake when we already have DataLake?
02:54:51 - Summary

#databricks  #delta  #pyspark   #practice  #dataengineering  #apachespark  #problemsolving 
 #spark  #bigdata #interviewquestions #sql #datascience #dataanalytics					
Apache Spark End-To-End Data Engineering Project | Apple Data Analysis

Nhạc Theo Chủ Đề

Liên kết website