Model Resuability in SparkML Library

This project demonstrates the development and integration of custom estimators and transformers within the SparkML pipeline framework.

Using a loan prediction dataset, I perform exploratory data analysis to understand the raw data. The preprocessing steps include:

- Converting feature data types using custom estimators and transformers
- Imputing missing values with custom estimators and transformers
- Indexing categorical feature levels using StringIndexer
- Encoding categorical features using OneHotEncoder
- Assembling input features into a vector using VectorAssembler

I then built a logistic regression model and interpreted the results. A pipeline was created, encompassing all preprocessing stages and model building steps. The final PipelineModel was used to save the model artifact, which I then successfully reloaded to score new/unseen data without retraining.

This project showcases the flexibility of SparkML, the power of custom components, and the efficiency of model reusability in machine learning workflows.

Git Repository

Share on

Twitter Facebook LinkedIn

Rushya Puttam

Share on