What Rebuilding Classic Machine Learning Algorithms from Scratch Actually Taught Me

Published April 13, 2026

Nowadays machine learning libraries make it possible to train strong baseline models in just a few lines of code.

That convenience is great for shipping fast.

But it also hides the exact layer where real intuition is built:
the implementation details.

To rebuild that intuition, I implemented several classic machine learning algorithms entirely from scratch in Python and evaluated them across both regression and classification datasets.

At first, I thought this would mainly be an exercise in re-deriving textbook formulas.

It turned into something much more useful:

a hands-on lesson in numerical stability, data representation, metric selection, and how model assumptions interact with real datasets.

This article is a technical retrospective on the most valuable lessons from that process.


Why Build Them from Scratch?

Using libraries like scikit-learn is excellent for production work.

But abstractions often compress away the exact failure modes that teach the most.

Rebuilding the models forced me to think carefully about:

  • how gradient descent (gradient descent) behaves under different loss scales
  • why distance metrics (distance metric) can completely change KNN performance
  • how categorical features (categorical features) reshape model suitability
  • why evaluation metrics (evaluation metrics) can be dangerously misleading
  • where numerical stability (numerical stability) issues first appear

The goal was never to outperform optimized libraries.

It was to develop a better systems-level intuition for classical machine learning.

That goal turned out to be far more valuable than I expected.


The Algorithms I Implemented

The project included full implementations of:

  • Linear Regression
  • Logistic Regression
  • Decision Tree
  • K-Nearest Neighbors (KNN)
  • Naive Bayes

I evaluated them on several very different datasets:

  • Boston Housing
  • Wine Quality
  • Breast Cancer
  • Mushroom Classification
  • Robot Execution Failures

This mix was useful because each dataset stresses a different modeling assumption.

Some reward linearity.

Some reward local neighborhood structure.

Some are dominated by categorical splits.

Some expose severe class imbalance.

That variety made algorithm behavior much easier to reason about.


The First Real Problem Wasn’t the Model — It Was the Loss Scale

The first major issue showed up in linear regression.

I initially implemented the optimization using Sum of Squared Errors (SSE).

Mathematically this works.

In practice, it introduced unstable optimization almost immediately.

Because SSE scales directly with dataset size, both the loss value and gradient magnitude grew larger than necessary.

That made learning rate (learning rate) tuning much more sensitive and occasionally caused unstable convergence.

Switching to Mean Squared Error (MSE) fixed the issue right away.

The predictive improvement itself was only moderate.

The real gain was training stability.

That implementation detail changed how I think about optimization:

optimization is shaped not only by the geometry of the loss surface, but also by the numerical scale of the gradients.

This is exactly the kind of lesson that rarely becomes obvious when working only with library APIs.


KNN Completely Changed My Intuition About “Simple Models”

I expected linear regression to dominate structured datasets like Boston Housing.

Instead, KNN consistently performed better.

That result was surprisingly instructive.

It highlighted a modeling truth that is easy to underestimate:

when the feature space preserves meaningful local structure, local similarity assumptions can outperform global linear assumptions.

This became even clearer on the wine quality datasets.

The relationship between chemical properties and quality score is not globally linear.

But nearby samples in feature space often share similar quality labels.

That is exactly the kind of structure KNN exploits well.

Implementing the full distance function (distance function) myself made this insight much more concrete.

A “simple” algorithm can be extremely strong when its assumptions match the geometry of the data.


Decision Trees Taught Me the Importance of Data Shape

Decision trees produced the most dramatic contrast in results.

For regression, they often performed poorly and sometimes produced negative R².

For classification, especially on discrete datasets, they performed exceptionally well.

The regression failure made sense after revisiting the implementation.

My tree used:

  • greedy threshold search
  • unique-value threshold candidates
  • no post-pruning
  • shallow max depth defaults
  • direct SSE-based split selection

That setup works well for clean piecewise boundaries.

But on noisy continuous targets, it can overfit local partitions and generalize poorly.

On the other hand, categorical classification datasets like Mushroom were a perfect fit.

The split logic naturally aligned with the feature structure.

This reinforced one of the strongest lessons from the entire project:

the shape of the data matters more than the theoretical sophistication of the algorithm.


The Most Valuable Lesson Came from a “Good” Accuracy Score

The biggest practical lesson came from the Robot Execution Failure datasets.

Some logistic regression runs reported very high accuracy.

At first glance, the models looked successful.

Then I checked the F1 score (F1 score).

It was exactly zero.

That immediately exposed the real behavior:

the model had learned to predict only the majority class.

This was a much stronger lesson than any textbook explanation of imbalance.

Seeing it emerge from my own implementation permanently changed how I evaluate classifiers.

Now I treat:

  • precision (precision)
  • recall (recall)
  • F1 score
  • confusion matrix (confusion matrix)

as primary metrics rather than optional diagnostics.

A high accuracy (accuracy) score can hide total model failure.

This project made that impossible to forget.


Most of the Hard Problems Were Outside the Algorithms

The actual formulas were rarely the hardest part.

The surrounding engineering work was.

The most time-consuming parts included:

  • categorical encoding (categorical encoding)
  • feature normalization (feature normalization)
  • distance computation cost
  • probability underflow (probability underflow)
  • variance smoothing
  • memory efficiency (memory efficiency)

For example, KNN is mathematically simple.

But brute-force distance search quickly becomes expensive as dataset size grows.

That naturally led me to think about production optimizations like:

  • KD-Tree
  • Ball Tree
  • Approximate Nearest Neighbor (ANN)

Similarly, Naive Bayes required careful variance smoothing to avoid divide-by-zero and unstable log probabilities.

These “small” implementation details are where real engineering intuition gets built.


The Biggest Shift in How I Think About ML

Before this project, I often thought about model selection as:

which algorithm is stronger?

Now I think about it as:

which assumptions best match the structure of the data?

That shift is much more practical.

The experiments made several lessons very clear:

  • feature representation often matters more than model complexity
  • local structure can beat global assumptions
  • numerical stability can dominate optimization quality
  • evaluation metrics shape the story your results tell
  • preprocessing can matter as much as the model itself

Rebuilding these algorithms from scratch made those lessons feel real.

And that deeper intuition has been far more valuable than simply knowing how to call .fit().

Comments