A quick guide to MachineLearningMastery’s Applied Machine learning process

Define the problem.

  • What is the problem?
    • Informal description
    •  Formalism
      • Task (T)
      • Experience (E)
      • Performance (P)
    • Create a list of assumptions
      • Rule of thumb or domain specific knowledge
    • Think about the similar problems that you have seen before.
  • Why does the problem need to be solved?
    •  Motivation
    • What need is fulfilled when this problem is solved.
  • How would I solve the problem?
    • List out what data you will collect.
    • How to prepare that data?
    • How to design the program to solve the problem?
    • Try prototype and experiments.

Prepare data

  • Data Selection: Consider what data is available, what data is missing, what data can be removed.
  • Data pre-processing: Organize the data by formatting, Cleaning and sampling from it.
  • Data Transformation: Transform the pre-processed data ready for the machine learning.
    •  Scaling
    • Decomposition
    • Aggregation
  • Identifying the outliers in the data
    • Outlier modelling
      • Extreme value analysis
      • Probabilistic and Statistical models
      • Linear models
      • Proximity based models
      • Information theoretic models
      • High dimensional outlier detection
  •  Improve model accuracy with data pre-processing
    • Add attributes to your data
      • Dummy attributes : Converting the categorical attributes to the n-binary attributes.
      • Transformed Attribute: A transformed variation of an attribute can be added to the data set in order to allow a linear method to exploit possible linear and non-linear relationships between attributes.
      • Missing data
    • Remove the data attributes
      • Projection: dimensionality reduction
      • Spatial sign: transform the data onto the surface of a multidimensional space.
      • Correlated attributes: Some algorithms degrades in importance with the existence of highly correlated data. In that case most correlated attributes can be deleted.
    • Transform data attributes
      • Centering: Mean of zero and the standard deviation of 1.
      • Scaling: Data normalization so that the values lies between 0 and 1.
      • Remove Skew: Some methods assumes the data to be normally distributed and hence removing the skewness improves the results.
      • Box-Cox: Used for removing the skewness.
      • Binning: Data discretization.
  • Feature Engineering: Feature engineering is getting most out of your data for predictive modelling. General Examples of feature engineering.
    • Decompose the categorical data.
    • Decompose the Date-Time.
    • Re-frame the numerical quantities.
  • Feature Selection:  Feature selection is different from dimensionality reduction. In Feature selection we include of exclude attributes without changing them. Feature Selecting algorithms
    • Filter methods: Choose or remove features based on a score: Chi Square test, Information Gain and Correlation coefficient.
    • Wrapper method: Different combination of features are prepared, evaluated and compared to other combinations. Assign score to these combinations based on model accuracy. Example: recursive feature elimination algorithm.
    • Embedded Methods: learn which feature contribute to the accuracy of the model while the model is being created. Examples: LASSO, Elastic Net, and Ridge Regression.
  • Handling imbalanced data: Classification problem where the classes are not represented equally. High accuracy results are obtained but that is very misleading in case of imbalanced classes. Ways to tackle:
    • Try to collect more data.
    • Try to change your performance metric.
      • Confusion matrix
      • Precision
      • Recall
      • F1 Score
      • Kappa
      • ROC curve
    • Try re-sampling your data
      •  Over-sampling
      • Under-sampling
    • Try generate synthetic samples
      • SMOTE
    • Try different algorithms
      • Although decision tree said to perform better, always try different algorithms. Try few popular decision tree algorithms C4.5, C5.0, CART, and Random Forest.
    • Try penalized models:  Add additional cost on the model for making classification mistakes  on the minority class during the training. Example: Penalized-SVM and Penalized-LDA.
    • Try a different perspective.
  • Data leakage in Machine Learning: Data leakage is when information from outside the training data-set is used the model. The easy way to know if there is data leak in our model is if you are achieving performance that seems a little too good to be true. Techniques to minimize the data leakage.
    • Perform data preparation within cross validation folds: Do not normalize or standardize the entire dataset. A non-leaky evaluation of machine learning algorithm in this situation would calculate the parameters for rescaling the data within each fold of the cross validation and use those parameters to prepare the data on the held out test fold on each cycle. Also, it is good idea to prepare the data within your cross validation folds including the tasks like feature selection, outlier removal, encoding, feature scaling and projection method for dimensionality reduction.
    • Holding back a validation data-set: Store away a validation set.
    • 5 tips to combat data leak
      • Temporal Cut-off
      •  Add noise
      • Remove the leaky variables
      • Use the pipeline
      • Use a holdout dataset

Spot check algorithms

  • Evaluating machine learning algorithms:
    • Test Harness: Data on which we train and test the data.
    • Performance Measure.
    • Test and Train data-set.
    • Cross validation.
    • Testing Algorithms:
      • Start with a random algorithm.
      • Select 5-10 standard algorithms that are appropriate to the problem.
      • Choose methods, include 10-20 different algorithms drawn from a diverse range of algorithms type.
  •  Why you should be spot-checking algorithms:
    • Benefits of spot checking algorithms:
      • Speed
      • Objective
      • Results
    • Tips for spot checking the algorithms:
      • Algorithm diversity.
      • Best Foot Forward.
      • Formal Experiment: Don’t play a lot.
      • Jumping-Off Point: best performing algorithms are a start point and not the solution to the problem.
      • Build your short list.
    • Top 10 Algorithms:
      • C4.5 : Decision tree algorithm
      • K mean: go-to clustering algorithm
      • Support Vector Machine: Huge field of study
      • Apriori: go to algorithm for the rule extraction
      • EM: go-to clustering algorithm
      • PageRank: graph-based problems
      • AdaBoost: family of boosting ensemble methods
      • Knn: simple and effective instance based methods
      • Naive Baise: Simple and robust use of the Bayes theorem of data.
      • CART: Classification and regression tree.
  • Choosing the right test options while evaluating the Machine Learning Algorithms.
  • Data driven approach to choosing the machine learning algorithm

Improve Results

  • Improve the machine learning results
    • Algorithm tuning: Modify the parameters.
    • Ensembles: Combining the results of various methods to get the improved results.
      • Bagging
      • Boosting
      • Blending
    • Extreme feature engineering
  • ML Performance improvement cheat sheet
    • Improve performance with the data
      • Get more data
      • Invent more data
      • Clean your data
      • Re-sample data
      • Re-frame your problem
      • Re-scale your data
      • Transform your data
      • Project your data
      • Feature selection
      • Feature engineering
    • Improve performance with Algorithms
      • Re-sampling Method
      • Evaluation metric
      • Baseline performance
      • Spot check linear algorithms
      • Spot check non-linear algorithms
      • Steal from literature
      • Standard configuration
    • Improve performance with algorithm tuning
      • Diagnostics
      • Try Intuition
      • Steal from the literature
      • Random Search
      • Grid search
      • Optimize
      • Alternate implementation
      • Algorithm Extensions
      • Algorithm customization
      • Contact Expert
    • Improve performance with ensembles
      • Blend model predictions
      • Blend Data representations
      • Blend data samples
      • Correct predictions
      • Learn to combine

Present Results

  • How to use machine learning results
    • Report results
      • Context
      • Problem
      • Solution
      • Findings
      • Limitations
      • Conclusions
    •  Operationalize
      • Model Tests
      • Tracking
  • Deploying the productive model to the production
    • Specify the performance requirements
    • Separate prediction algorithm from model coefficient
      • Select or implement the prediction algorithm
      • Serialize your model coefficients.
    • Develop the automated tests for your model
    • Develop the back-testing and now-testing infrastructure
    • Challenge the then trial model updates

Leave a comment