12

It's amazingly difficult to find an outline of the end-to-end machine learning process. As a total beginner, this lack of information is frustrating, so I decided to try scraping together my own process by looking at a lot of tutorials that all do it a slightly different way.

I would like to have a standard process to go by, and once I am comfortable with it, I can choose to deviate. I'd like some input from you pillars of the industry. Is this a good routine for a beginner to follow?

  1. Get Data
  2. Clean Data
  3. Split data into Training and Test Data ~(80/20)
  4. Separately, for training and test sets:
    1. Normalize Data (continuous features):
      • standardize (divide by std. deviation)
      • center (subtract mean)
    2. Impute missing values
    3. Feature Engineering
    4. Encode Categorical Variables:
      • Integer Encoding
      • One Hot Encoding
      • Target Encoding
      • Weight of Evidence
  5. Separate labels from Test set if classification problem. Keep aside.
  6. Choose a few models.
  7. for each model, using k-fold cross-validaton:
    1. Train base model on "training set".
    2. Tune and test hyper parameters on "validation set"
    3. Save best scores and parameters
  8. Compare each model's final scores on the never touched test data

  9. Choose the model with highest scores.

Edit: Thank you for the overwhelming number of responses. Lots of times my questions get a single answer or none at all. I appreciate the time taken to help out a beginner.

I have edited the steps above to reflect the wonderful answers below. I hope that this helps another beginner somewhere else.

rocksNwaves
  • 309
  • 1
  • 11

5 Answers5

13

This process will result in data leaks. The split needs to happen earlier. Normalizing data before the split means that your training data contains information about your test data. I would put the split at 3. in your flow chart.

A common step I think you have missed is imputation of missing values. I would put that before feature engineering.

Overall I think this is a good rough outline for a beginner to follow. It is overly simplistic and leaves a lot out, but I think you know that and you have to start somewhere.

Simon Larsson
  • 4,313
  • 1
  • 16
  • 30
3

Yes, these are the basics step. Then in each step there is a lot more. If you want to get a bit deeper you can follow this book of Andriy Burkov of Machine Learning engineering

A couple notes in your process:

Before get data I Will put, define the question to resolve or something similar, but maybe this parted is granted.

Feature Engineering is one of the most important thing in ML, so probably spending a bit more of time there would help.

Normalize data helps mainly in Linear models, decision trees model has little/no effect.

Integer/Label Encoding is not specially good, there are better things as Target Encoding and Weight of Evidence encoding, have a look.

Carlos Mougan
  • 6,430
  • 2
  • 20
  • 51
3

After 12 "Choose the model with highest scores." Maybe add "create ensemble of models" and try to improve accuracy further.

Sushil K
  • 31
  • 3
2

Is this a good routine for a beginner to follow?

Yes, it's very good.

You could add:

  • K-fold Cross-validation("Split Training into Training and Test Data")
  • Feature selection before "Choose a few models."
fuwiak
  • 1,373
  • 8
  • 14
  • 26
1

Is this the end-to-end process?

  • Most importantly, you also need to understand the data you are using. It's not supposed to be a meat-grinder. Add some uni and multivariate analysis just before splitting your data. Look at the distributions and frequencies.
  • After you split 70/30 or 80/20 or whatever, are the distributions approximately similar?
  • I think you should also add touching base with stakeholders/business people just after feature engineering (and maybe add a loop arrow to reflect their feedback).
  • Another user mentioned ensemble models / model averaging at the end - I think that is also important. Wouldn't an ensemble model perform better that any single model?
  • You are also missing documentation - where are you documenting your steps? Is it all in your mind? How will others follow what you are doing?
  • What about four-eyes check aka pair programming?
  • What about version control? In most industries you will need to show how your models were derived and how they perform against alternatives.
  • What about edge cases for reasonable results for the best 2-3 models
  • Model explainability - how can you or your users trust the model without understanding how it is operating.
J. Doe.
  • 111
  • 2