Machine Learning with Big Data
Learn how to apply statistical learning techniques to big data in Python by building, interpreting, visualising and evaluating distributed machine learning models optimised for massive data volumes.
This course provides a hands-on and in-depth exploration of the industry-standard Apache Spark unified analytics engine, and specifically its MLlib distributed machine learning library with which to build, visualise and evaluate distributed machine learning models applied to real-world business problems and use-cases that require learning from massive data volumes ranging from gigabytes (GB) to petabytes (PB) in size. This course follows on from our Statistical Learning course, and enables senior data scientists to apply the mathematical techniques introduced in that course to real-world use-cases, from which they can make predictions and derive actionable insights from big data. As such, this course details how to build and evaluate linear models for regression and classification, tree-based models and clustering models. This course also details applied techniques for model selection and fine-tuning applied to big data volumes.