Apache Spark For Data Engineering

Apache Spark For Data Engineering

1.217 Lượt nghe
Apache Spark For Data Engineering
In this Databricks #tutorial, I demonstrate key #pyspark transformations and the most common data #engineering techniques. I also demonstrate applications of these techniques to address common data engineering challenges. Chapters: 00:00:00- Learning objectives 00:05:53- Overview of Apache Spark: Key features, concepts, languages 00:10:59- Course setup 00:19:04- File System commands fs, dbutils 00:20:15- Reading csv files into the data frame 00:22:39- Inferring file schema 00:23:45- Specifying source file schema 00:24:11- Read/write to Delta Table 00:26:18-Inspecting table properties 00:26:59-Using SQL magic 00:27:30-Ingesting semi-structured (JSON) data 00:28:40-Using StuctType and StructField constructs 00:30:05-Column transformations: Select method 00:31:20-SelectExpr method 00:31:53-Useful SQL functions:Cast, round, metadata 00:33:19-Using withColumn method 00:34:58-Generating sequential row ids with monotonically_increasing_id function 00:36:28-Conditional expressions: when..otherwise function 00:37:08- Parsing complex data 00:38:27-Table transformations: Filters 00:39:27-Ordering data 00:41:26-Table joins 00:42:02-Aggregates 00:44:53-Window functions: Row ranking 00:46:50-Comparative analysis functions: Lag and Lead 00:51:23-User Defined Functions: Python functions 00:55:34-Pandas UDFs: Series to Series 00:57:38-Pandas UDFs: Iterator of Series to Iterator of Series 01:00:37-Pandas UDFs: Group Mapping with applyInPandas method 01:02:40- Pandas UDFs:mapInPandas 01:04:10-Schema evolution 01:07:03- Auto-generated columns in Spark SQL 01:09:14-Using Delta API for greater ingestion control (delete, update, upsert) 01:15:45-Table optimization 01:17:05-Common Data Engineering Challenges and Solutions: Deduplication 01:20:32-Handling missing values: Skip the rows or replace values 01:24:35-Generating Date/Time dimensions with Auto-generate feature 01:26:27-Schema normalization 01:29:55-Time travel 01:33:07-Table clones: Deep and shallow clones 01:35:05-Table partitions Please subscribe: https://www.youtube.com/channel/UC8d958MxE2t1dr27QNqoOhA Download demo/exercise notebooks from here: https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/Study%20PySpark.dbc Source data files: https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/product_data.json https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/sales_data.csv https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/sales_data_corrupted.csv To sign up for the Databricks community edition, see this: https://docs.databricks.com/en/getting-started/community-edition.html