Distributed Data Engineering
Learn how to perform data transformations on big data in Python by building and deploying distributed data pipelines optimised for processing massive data volumes.
This course provides a hands-on and in-depth exploration of the industry-standard Apache Spark unified analytics engine, and specifically its Spark SQL, DataFrames and Dataset API with which to build distributed data pipelines capable of processing massive data volumes ranging from gigabytes (GB) to petabytes (PB) in size. This course follows on from our Python for Data Analysis course, and enables experienced senior data engineers to load, model, transform, merge and analyse huge volumes of structured and unstructured data.
- The ability to load large structured, semi-structured and unstructured data files (including Parquet, ORC, JSON and Avro files) into distributed and efficient in-memory data structures.
- The ability to design, build and optimise end-to-end distributed data pipelines capable of loading, merging and transforming large disparate datasets, and saving post-transformed and post-modelled data into SQL and NoSQL distributed databases.
- The ability to analyse and derive actionable insights from large disparate datasets in order to solve real-world business problems (e.g. descriptive statistics, trend analysis and forecasting).
- Knowledge of the industry-standard Apache Spark unified analytics engine for distributed transformations of big data.