Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Apache Spark is a general-purpose big data execution engine. You can work with different data sources with the same set of API in both batch and streaming mode. Such flexibility is great if you are experienced Spark developer solving a complicated data engineering problem, which might include ML or streaming. In Airbnb, 95% of all data pipelines are daily batch jobs, which read from Hive tables and write to Hive tables. For such jobs, you would like to trade some flexibility for more extensive functionality around writing to Hive or multiple days processing orchestration. Another advantage of reducing flexibility is creating ‘best practices’, which can be followed by less experienced data engineers.
In AirBnB we’ve created a framework called ‘Sputnik’, which tries to address these issues. Data engineers need to extend the sputnik base class and write code for data transformation without bothering about the filtering of dates for which the job would run. End users do not read or write to Hive directly, they use Sputnik wrappers for Hive. Read wrapper filters input data based on parameters from the console including the time frame. Write wrapper get information about result table from case class annotations, writes meta-information about the table, makes verifications on the data and much more. The core idea of the framework is that all functionality of the job consists of job-specific logic and run-specific logic. Job specific logic is a transformation defined by data engineer and meta information about the tables. Run specific logic is filtering input data based on current date and writing data to Hive. Data Engineer needs to specify job-specific logic, and Sputnik handles all run specific logic based on assumptions about the right way of operating daily Hive batch jobs. https://github.com/airbnb/sputnik
About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unifie...
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner