Apache Spark For Data Engineering

1.217 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Apache Spark For Data Engineering

In this Databricks #tutorial, I demonstrate key #pyspark transformations and the most common data #engineering techniques. I also demonstrate applications of these techniques to address common data engineering challenges.
Chapters:
00:00- Learning objectives
05:53- Overview of Apache Spark: Key features, concepts, languages
10:59- Course setup
19:04- File System commands fs, dbutils
20:15- Reading csv files into the data frame
22:39- Inferring file schema
23:45- Specifying source file schema
24:11- Read/write to Delta Table
26:18-Inspecting table properties
26:59-Using SQL magic
27:30-Ingesting semi-structured (JSON) data
28:40-Using StuctType and StructField constructs
30:05-Column transformations: Select method 
31:20-SelectExpr method 
31:53-Useful SQL functions:Cast, round, metadata
33:19-Using withColumn method
34:58-Generating sequential row ids with monotonically_increasing_id function
36:28-Conditional expressions: when..otherwise function
37:08- Parsing complex data
38:27-Table transformations: Filters
39:27-Ordering data
41:26-Table joins
42:02-Aggregates
44:53-Window functions: Row ranking
46:50-Comparative analysis functions: Lag and Lead
51:23-User Defined Functions: Python functions
55:34-Pandas UDFs: Series to Series
57:38-Pandas UDFs: Iterator of Series to Iterator of Series
00:37-Pandas UDFs: Group Mapping with applyInPandas method
02:40- Pandas UDFs:mapInPandas
04:10-Schema evolution
07:03- Auto-generated columns in Spark SQL
09:14-Using Delta API for greater ingestion control (delete, update, upsert)
15:45-Table optimization
17:05-Common Data Engineering  Challenges and Solutions: Deduplication
20:32-Handling missing values: Skip the rows or replace values
24:35-Generating Date/Time dimensions with Auto-generate feature
26:27-Schema normalization
29:55-Time travel
33:07-Table clones: Deep and shallow clones
35:05-Table partitions

Please subscribe: https://www.youtube.com/channel/UC8d958MxE2t1dr27QNqoOhA
Download demo/exercise notebooks from here: 
https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/Study%20PySpark.dbc
Source data files:
https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/product_data.json
https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/sales_data.csv
https://github.com/fazizov/youtube/blob/main/Data%20engineering%20with%20Databricks/MasteringPyspark/sales_data_corrupted.csv
To sign up for the Databricks community edition, see this: 
https://docs.databricks.com/en/getting-started/community-edition.html					

Apache Spark For Data Engineering

Nhạc Theo Chủ Đề

Liên kết website