Intro to Machine Learning

A broad overview of machine learning – what it is, how it works, and the tools used to build models in Python.

If you showed a picture of a tiger to any person on the street, they would name it instantly. We recognize the orange body, the black stripes, the shape of the face – all of it in a fraction of a second.

For a computer, that same task is surprisingly hard.

A computer does not see an image the way we do. It sees a grid of numbers – each pixel is a set of values describing a color and a position. Nothing about those numbers obviously screams "tiger." So how does a computer ever get there?

It learns. And the field that makes that possible is called machine learning.

What a Model Is

When a machine learning system learns from data, what it produces is called a model. Think of a model as a function: you give it an input, and it gives you an answer.

In the tiger example, the input is an image and the output might be "97% tiger, 3% lion." The model did not have that answer built in – it figured it out by studying thousands of labeled images until it learned what tigers tend to look like.

A photo of a tiger with a prediction bar showing 97% Tiger and 3% Lion overlaid on the bottom. — The model assigns a confidence score to each category. It learned those associations from training data, not from hand-written rules.

Two Types of Tasks: Classification and Regression

Not all machine learning problems are about sorting things into categories. There are two broad types of tasks.

The tiger example is a case of classification. The model picks from a fixed set of options – in this case, "tiger" or "lion."

The other type is regression. Instead of picking a label, the model predicts a value.

Take housing prices. If you give a model data about houses – square footage, number of bedrooms, neighborhood – it can learn the mathematical relationship between those inputs and the price. Once trained, give it a new square footage and it estimates a price.

A scatter plot with square footage on the x-axis and price on the y-axis. A best-fit line runs through the data points. — Linear regression finds the line that best fits the data. That line becomes the model's rule for estimating new prices.

Quick check

What is the difference between classification and regression?

Where the Data Comes From: Training Data

Before a model can make any predictions, it needs to learn. That learning comes from training data. In the tiger example, training data might be thousands of labeled animal photos – each one tagged with the correct animal name.

The quality and size of the training data matters enormously. A model trained on blurry photos of only three tigers will not do well on new images. More data, more variety, and accurate labels all lead to better results.

One of the harder parts of building a machine learning model is getting that data into shape. Raw data is often messy – missing values, inconsistent formats, irrelevant columns. Data preprocessing is the step where you fix all of that. It is unglamorous work, but skipping it leads to bad models.

Quick check

Why does the quality of training data matter?

Supervised vs. Unsupervised Learning

Machine learning algorithms generally fall into one of two categories depending on how their training data is structured.

Supervised learning uses labeled data – each training example has an input and the expected output. The tiger classifier is supervised: every image comes with a label saying what animal it is. The housing price predictor is also supervised: every house comes with its actual price. The model's job is to learn the pattern connecting inputs to outputs.

Unsupervised learning works differently. There are no correct answers provided – just raw data. The algorithm has to find structure on its own.

A common unsupervised technique is clustering. Given a dataset of customer purchase histories, a clustering algorithm might group customers with similar buying habits together – without anyone telling it what the groups should be.

A scatter plot with unlabeled data points. Circles are drawn around three distinct clusters of points. — Clustering finds natural groupings in data. The algorithm discovers the groups on its own – no labels needed.

Training Data vs. Testing Data

There is a trap that is easy to fall into when building a machine learning model: the model gets so good at memorizing its training examples that it fails when it sees anything new. This is called overfitting.

Think about studying for a final exam. If you only memorize the exact answers from last year's practice tests without understanding the underlying concepts, you might score perfectly on those old tests – but struggle the moment a question is phrased differently. You have memorized, not learned.

To catch overfitting, the data is split into two parts: training data and testing data. The model learns from the training portion. Then it is evaluated on the testing portion – data it has never seen before.

If the model performs well on both, it has genuinely learned. If it does great on training but poorly on testing, it has overfit.

src: train-test-split.png
alt: A data table being divided into two labeled sections: Training Data on the left and Testing Data on the right.
caption: The training set teaches the model. The test set checks whether it actually learned – or just memorized.
sourceUrl: https://machine-learning-and-data-science-with-python.readthedocs.io/en/latest/assignment1_sup_ml.html
fit: contain

Quick check

What does it mean if a model performs well on training data but poorly on testing data?

The Python Tools Behind Machine Learning

Most machine learning is built with Python. It is the dominant language in the field for a good reason: the code is readable, and the ecosystem of libraries built for data and machine learning is enormous.

Here is a quick map of the most commonly used ones:

Pandas handles data loading and cleaning. When you pull in a raw CSV of housing records and need to filter out bad rows or rename columns, Pandas is the tool for it.

NumPy handles the math underneath. It makes working with large arrays of numbers fast and efficient. Most other libraries are built on top of it.

For building models, the three most common libraries are Scikit-learn, TensorFlow, and PyTorch. Scikit-learn is the go-to for classical machine learning algorithms – decision trees, K-Nearest Neighbors, regression models. TensorFlow and PyTorch are for deep learning – neural networks and more complex architectures.

To understand what your data looks like and how your model is performing, Matplotlib and Seaborn let you build charts and visualizations directly in Python.

Together, these libraries form the foundation of almost every machine learning project you will encounter.

Quick check

Which Python library would you use to load and clean a raw dataset before training a model?

What Comes Next

This lesson covered the big picture – what machine learning is, how models learn, the difference between classification and regression, supervised vs. unsupervised learning, and the tools used to build models in Python.

Each of those topics has a lot more depth to it. Future lessons will go deeper on specific algorithms, how to evaluate model performance, and how to build models from scratch. This is just the foundation.