Data quality is a critical concern within a complex data environment, particularly when dealing with a substantial volume of data distributed across multiple locations. To systematically identify and visualise potential issues, establish periodic scans, and notify the relevant teams at an organisational level on a significant scale, where should one begin?
This is precisely where the automated data profiling and data quality scanning capabilities of Dataplex on Google Cloud can prove invaluable. Requiring no infrastructure setup and offering a straightforward method for defining and implementing rules for data profiling and quality checks, it could serve as an excellent foundation for your large-scale data quality framework.
01:16 - Data Profiling vs Data Quality Scan
02:37 - Dataplex auto profiling
08:15 - Dataplex auto data quality scan
10:47 - Profiling hinted quality rules & YAML via CLI
18:36 - Other options to create scans
21:08 - Sensitive data considerations
22:02 - Summary
Slide: https://drive.google.com/file/d/13khs0dM-TTsqcoyqWanTSDJHcAHpptqV/view
Repo: https://github.com/rocketechgroup/dataplex_auto_dq
---
Need to modernise your data stack? I specialise in Google Cloud solutions, including migrating your analytics workloads into BigQuery, optimising performance, and tailoring solutions to fit your business needs. With deep expertise in the Google Cloud ecosystem, I’ll help you unlock the full potential of your data. Curious about my work? Check out https://www.fundamenta.co/my-work to see the impact I’ve made. Let’s chat! Book a call at https://calendly.com/richard-he-fundamenta or email
[email protected]. 🚀📊