In this Databricks #tutorial, I demonstrate key #pyspark transformations and the most common data #engineering techniques. I also demonstrate applications of these techniques to address common data engineering challenges.
Chapters:
00:00:00- Learning objectives
00:05:53- Overview of Apache Spark: Key features, concepts, languages
00:10:59- Course setup
00:19:04- File System commands fs, dbutils
00:20:15- Reading csv files into the data frame
00:22:39- Inferring file schema
00:23:45- Specifying source file schema
00:24:11- Read/write to Delta Table
00:26:18-Inspecting table properties
00:26:59-Using SQL magic
00:27:30-Ingesting semi-structured (JSON) data
00:28:40-Using StuctType and StructField constructs
00:30:05-Column transformations: Select method
00:31:20-SelectExpr method
00:31:53-Useful SQL functions:Cast, round, metadata
00:33:19-Using withColumn method
00:34:58-Generating sequential row ids with monotonically_increasing_id function
00:36:28-Conditional expressions: when..otherwise function
00:37:08- Parsing complex data
00:38:27-Table transformations: Filters
00:39:27-Ordering data
00:41:26-Table joins
00:42:02-Aggregates
00:44:53-Window functions: Row ranking
00:46:50-Comparative analysis functions: Lag and Lead
00:51:23-User Defined Functions: Python functions
00:55:34-Pandas UDFs: Series to Series
00:57:38-Pandas UDFs: Iterator of Series to Iterator of Series
01:00:37-Pandas UDFs: Group Mapping with applyInPandas method
01:02:40- Pandas UDFs:mapInPandas
01:04:10-Schema evolution
01:07:03- Auto-generated columns in Spark SQL
01:09:14-Using Delta API for greater ingestion control (delete, update, upsert)
01:15:45-Table optimization
01:17:05-Common Data Engineering Challenges and Solutions: Deduplication
01:20:32-Handling missing values: Skip the rows or replace values
01:24:35-Generating Date/Time dimensions with Auto-generate feature
01:26:27-Schema normalization
01:29:55-Time travel
01:33:07-Table clones: Deep and shallow clones
01:35:05-Table partitions
Please subscribe: https://www.youtube.com/channel/UC8d958MxE2t1dr27QNqoOhA
Download demo/exercise notebooks from here:
https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/Study%20PySpark.dbc
Source data files:
https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/product_data.json
https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/sales_data.csv
https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/sales_data_corrupted.csv
To sign up for the Databricks community edition, see this:
https://docs.databricks.com/en/getting-started/community-edition.html