Linear Regression Algorithm

Summary

Split data (train/test or CV).
Fit OLS (optionally Ridge/Lasso/ElasticNet).
Report R2, RMSE/MAE on test set.
Inspect residuals: mean ≈0 (train), histogram/KDE, residuals vs fitted, Q–Q.
Check multicollinearity: VIF (<4–5 is fine).
Probe outliers/influence; consider interactions/nonlinearity if patterns remain.

In [3]:

! pip install pandas numpy matplotlib seaborn scikit-learn

Out [3]:

Collecting pandas
  Using cached pandas-2.3.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Requirement already satisfied: numpy in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (2.3.3)
Collecting matplotlib
  Using cached matplotlib-3.10.6-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from pandas) (2.9.0.post0)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.60.0-cp313-cp313-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (111 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.9-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (6.3 kB)
Requirement already satisfied: packaging>=20.0 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from matplotlib) (25.0)
Requirement already satisfied: pillow>=8 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from matplotlib) (11.3.0)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Downloading pyparsing-3.2.5-py3-none-any.whl.metadata (5.0 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.2-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (62 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: six>=1.5 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Downloading pandas-2.3.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m527.7 kB/s[0m  [33m0:00:22[0mm0:00:01[0m00:01[0m
[?25hDownloading matplotlib-3.10.6-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m634.1 kB/s[0m  [33m0:00:13[0m eta [36m0:00:01[0m
[?25hDownloading seaborn-0.13.2-py3-none-any.whl (294 kB)
Downloading scikit_learn-1.7.2-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m648.0 kB/s[0m  [33m0:00:14[0m eta [36m0:00:01[0m
[?25hDownloading contourpy-1.3.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (362 kB)
Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Downloading fonttools-4.60.0-cp313-cp313-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl (4.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m750.3 kB/s[0m  [33m0:00:06[0m eta [36m0:00:01[0m
[?25hDownloading joblib-1.5.2-py3-none-any.whl (308 kB)
Downloading kiwisolver-1.4.9-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m646.2 kB/s[0m  [33m0:00:02[0m eta [36m0:00:01[0m
[?25hDownloading pyparsing-3.2.5-py3-none-any.whl (113 kB)
Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
Downloading scipy-1.16.2-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.7/35.7 MB[0m [31m605.6 kB/s[0m  [33m0:00:58[0mm0:00:01[0m00:02[0m
[?25hDownloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, threadpoolctl, scipy, pyparsing, kiwisolver, joblib, fonttools, cycler, contourpy, scikit-learn, pandas, matplotlib, seaborn
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14/14[0m [seaborn]3/14[0m [seaborn]ib]n]
[1A[2KSuccessfully installed contourpy-1.3.3 cycler-0.12.1 fonttools-4.60.0 joblib-1.5.2 kiwisolver-1.4.9 matplotlib-3.10.6 pandas-2.3.2 pyparsing-3.2.5 pytz-2025.2 scikit-learn-1.7.2 scipy-1.16.2 seaborn-0.13.2 threadpoolctl-3.6.0 tzdata-2025.2

In [4]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.model_selection as ms
from sklearn.linear_model import LinearRegression
%matplotlib inline

1. Dataset

Data source: Advertising.csv

Features (inputs): TV, radio, newspaper (advertising budgets in thousands of dollars).
Target (output): sales (product sales in thousands of units).

Purpose: We want to build a model that can predict sales based on how much is spent on different types of advertising.

In [6]:

df = pd.read_csv("Advertising.csv")
df

Out [6]:

        TV  radio  newspaper  sales
1    230.1   37.8       69.2   22.1
2     44.5   39.3       45.1   10.4
3     17.2   45.9       69.3    9.3
4    151.5   41.3       58.5   18.5
5    180.8   10.8       58.4   12.9
..     ...    ...        ...    ...
196   38.2    3.7       13.8    7.6
197   94.2    4.9        8.1    9.7
198  177.0    9.3        6.4   12.8
199  283.6   42.0       66.2   25.5
200  232.1    8.6        8.7   13.4

[200 rows x 4 columns]

In [7]:

df.head()

Out [7]:

      TV  radio  newspaper  sales
1  230.1   37.8       69.2   22.1
2   44.5   39.3       45.1   10.4
3   17.2   45.9       69.3    9.3
4  151.5   41.3       58.5   18.5
5  180.8   10.8       58.4   12.9

Input : TV, radio, newspaper

Output : sales

In [8]:

X = df.drop(columns=["sales"])
y = df['sales']
y.head()

Out [8]:

1    22.1
2    10.4
3     9.3
4    18.5
5    12.9
Name: sales, dtype: float64

In [9]:

df.shape

Out [9]:

(200, 4)

2. Sampling / Data Splitting

In ML, we never train and evaluate on the same data.

The Code Uses :

Training set (70 %) → used to fit the model.
Test set (30 %) → kept aside to evaluate final model performance.
random_state=4 ensures reproducibility.

In [10]:

X_train, X_test, y_train, y_test = ms.train_test_split(X, y, random_state=4 ,test_size=0.3)
X_train.shape, X_test.shape

Out [10]:

((140, 3), (60, 3))

3. Training the ML Model

The training data goes through Linear Regression, a supervised learning algorithm:

The algorithm finds the best-fit coefficients (lr.coef_) and an intercept (lr.intercept_) to minimize the Mean Squared Error on training data.

This produces a trained ML model, i.e., a mathematical equation:

`` sales ≈ β0 + β1·TV + β2·radio + β3·newspaper ``

In [11]:

lr = LinearRegression()
lr.fit(X_train, y_train)

Out [11]:

LinearRegression()

In [12]:

lr.coef_

Out [12]:

array([0.04533392, 0.17715767, 0.00553852])

In [13]:

lr.intercept_

Out [13]:

np.float64(3.0906443641251116)

In [14]:

X_test[:5]

Out [14]:

        TV  radio  newspaper
12   214.7   24.0        4.0
100  135.2   41.7       45.9
129  220.3   49.0        3.2
176  276.9   48.9       41.8
2     44.5   39.3       45.1

4. Prediction

With the trained model, you make predictions on unseen test data:

Inputs: the advertising budgets in X_test.
Outputs: predicted sales y_pred_test.

This step checks how well the model generalizes beyond the training examples.

In [15]:

y_pred_test = lr.predict(X_test)

In [16]:

y_pred_test

Out [16]:

array([17.09777441, 16.86148273, 21.7761552 , 24.53812602, 12.32008724,
       20.9083371 , 10.09857022, 12.38326275, 12.01767713, 16.939182  ,
       22.8039863 , 15.18438755,  8.45988902, 14.86958246, 18.5360064 ,
       18.5391287 , 18.67614759, 14.5131948 ,  9.74192603, 20.57441729,
       14.89243548, 17.70102466, 23.04193509,  7.718106  , 21.42075683,
       19.62792915, 13.79733409, 11.63827129, 18.42655599, 12.19463761,
       11.40777433,  9.96083851, 13.49436555, 15.18912412, 18.09830593,
        6.92815403, 14.21050378, 14.27217178, 14.10451207, 10.77723418,
       14.82936609, 11.96507878, 10.69651541, 10.40851022,  6.21159776,
       16.82112487, 16.4659736 , 12.07626636, 10.75281419, 20.84690688,
       15.09266208, 15.39499668, 13.27964587, 14.1081843 , 12.38574411,
        9.03289588, 20.48246311,  9.3517132 ,  5.36311192,  8.24848891])

Evaluation

MSE (Mean Squared Error) – average of squared prediction errors.
RMSE (Root Mean Squared Error) – interpretable in the same units as sales.
MAE (Mean Absolute Error) – average of absolute prediction errors.
R² score – fraction of variance in sales explained by the model (1.0 is perfect fit).

These metrics show how closely predictions match actual sales.

In [17]:

from sklearn.metrics import mean_squared_error, mean_absolute_error, root_mean_squared_error, r2_score

In [18]:

print(f"MSE is {mean_squared_error(y_test, y_pred_test)}")
print(f"RMSE is {root_mean_squared_error(y_test, y_pred_test)}")
print(f"MAE is {mean_absolute_error(y_test, y_pred_test)}")

Out [18]:

MSE is 2.2237746679115578
RMSE is 1.4912326002041256
MAE is 1.2111525124980485

R2 Score

How much variance can be explained by the given feature

In [19]:

print(f"R2-score is {r2_score(y_test, y_pred_test)}")

Out [19]:

R2-score is 0.9127220729596172

Dataset
   │
   ├─► Sampling → Training Set ─► Train ML Model ─► Trained Model
   │                                 │
   │                                 ▼
   │                         Prediction on Test Set
   │                                 │
   └─► Test Set ─────────────────────┘
                                      │
                                      ▼
                             Evaluation (MSE, RMSE, MAE, R²)

Regularization

Why Regularization?

Linear regression tries to find coefficients (weights) that minimize error on training data.

Problem: If features are highly correlated or you have many of them, coefficients can grow very large (overfitting).
Solution: Regularization adds a penalty on large coefficients so the model is simpler and generalizes better.

Mathematically, we modify the cost (loss) function:

Ordinary Least Squares (OLS):
   Minimize   Σ(y - ŷ)²

Regularized Regression:
   Minimize   Σ(y - ŷ)² + penalty( coefficients )

Different penalties → different regularization methods.

L1 (Lasso) Regularization

Idea

Add a penalty proportional to the absolute value of coefficients:

$$ \min_{w} \left[ \sum_{i=1}^{n} \bigl(y_i - \hat{y}_i \bigr)^2 + \alpha \sum_{j=1}^{p} \lvert w_j \rvert \right] $$

α (alpha) controls strength of penalty.
Encourages some coefficients to shrink to exactly zero → feature selection.
Effect

Sparsity: keeps only the most important features.
Good when you suspect only a few predictors are truly relevant.

In [20]:

from sklearn.linear_model import Lasso
las = Lasso()
las.fit(X_train, y_train)
y_pred_test = las.predict(X_test)
print(f"R2 score is {r2_score(y_test, y_pred_test)}")

Out [20]:

R2 score is 0.9112250903194142

L2 (Ridge) Regularization

Add a penalty proportional to the square of the coefficients:

Coefficients are shrunk towards zero, but usually not exactly zero.
Handles multicollinearity well.

$$ \min_{w} \left[ \sum_{i=1}^{n} \bigl(y_i - \hat{y}_i \bigr)^2 + \alpha \sum_{j=1}^{p} w_j^2 \right] $$

Effect: All features remain in the model, but their influence is reduced.

In [21]:

from sklearn.linear_model import Ridge
ridge = Ridge()
ridge.fit(X_train, y_train)
y_pred_test = ridge.predict(X_test)
print(f"R2 score is {r2_score(y_test, y_pred_test)}")

Out [21]:

R2 score is 0.9127183051915455

ElasticNet Regularization (L1+L2)

Combine Lasso and Ridge penalties:

\min_{w} \left[ \sum_{i=1}^{n} \bigl(y_i - \hat{y}_i \bigr)^2

+ \alpha \Bigl( \rho \sum_{j=1}^{p} \lvert w_j \rvert

+ (1 - \rho) \sum_{j=1}^{p} w_j^2 \Bigr) \right]

Effect

Keeps model balanced:

Can zero-out some coefficients (L1 effect).
Shrinks remaining ones smoothly (L2 effect).

Works well when you have many correlated features.

In [22]:

from sklearn.linear_model import ElasticNet
elasticnet = ElasticNet()
elasticnet.fit(X_train, y_train)
y_pred_test = elasticnet.predict(X_test)
print(f"R2 score is {r2_score(y_test, y_pred_test)}")

Out [22]:

R2 score is 0.9117469302155123

Assumption of the Linear Relationship

Linear relationship (pairplot)

What it shows: Simple (bivariate) relationships between each feature and sales.
Your observations match typical patterns on this dataset:

TV vs sales: points roughly align along an upward line → clear linear trend.
radio vs sales: still upward but more scatter → weaker linear trend.
newspaper vs sales: diffuse cloud → little/no linear trend.

Caveat: Pairplots are bivariate; a feature that looks weak alone can still help in combination with others (interactions, shared variance).

In [23]:

## Linear Relationship between the feature and the target
df.head()

Out [23]:

      TV  radio  newspaper  sales
1  230.1   37.8       69.2   22.1
2   44.5   39.3       45.1   10.4
3   17.2   45.9       69.3    9.3
4  151.5   41.3       58.5   18.5
5  180.8   10.8       58.4   12.9

In [24]:

import seaborn as sns
sns.pairplot(df, x_vars=['TV','radio','newspaper'], y_vars=['sales'])
plt.show()

Out [24]:

Correlation matrix (Pearson r)

Interpretation: Strength and direction of linear association.
TV–sales ≈ 0.78 radio–sales ≈ 0.58 newspaper–sales ≈ 0.23

TV has the strongest positive correlation with sales.
Radio is moderate.
Newspaper is weak.

Note: Correlation ≠ causation, and it’s still bivariate. In multiple regression, a feature with low marginal correlation can gain/lose importance when others are present.

In [25]:

corr = df.corr()
corr.head()

Out [25]:

                 TV     radio  newspaper     sales
TV         1.000000  0.054809   0.056648  0.782224
radio      0.054809  1.000000   0.354104  0.576223
newspaper  0.056648  0.354104   1.000000  0.228299
sales      0.782224  0.576223   0.228299  1.000000

Residual mean (≈ 0)

We computed: mean(residual) ≈ -0.0756 on the test set.
In OLS with an intercept, the training residuals have mean exactly 0 (by construction).
On the test set, the mean residual is not guaranteed to be zero; a small value (like −0.076 in units of sales) is normal and typically negligible.

What are “errors” (residuals)?

In regression, a residual is the difference between what really happened and what your model predicted. It tells you how far off the prediction was for each test observation.

Why plot their distribution?

We plot a histogram (optionally with a KDE curve) to see the overall pattern of these errors.

For a well-behaved linear regression:

Centered at 0 : Predictions are not systematically too high or too low.
Symmetric, bell-shaped : Errors are equally likely to be positive or negative of a given size.
No long tails or spikes : No unusually large misses (outliers) and variance is constant.
A classic “good” histogram looks like a normal (Gaussian) bell curve centered at 0.

In [26]:

# mean residuals = 0
residual = (y_test - y_pred_test)
np.mean(residual)

Out [26]:

np.float64(-0.07559702903921825)

Distribution of errors (histogram + KDE)

We’re checking the normality of residuals. For linear regression:

Prediction accuracy does not require normal residuals.
Inference (valid t-tests, CIs) benefits from approximate normality.
What to look for: bell shape, centered near 0, no heavy tails.
Follow-ups (more rigorous): Q–Q plot for normality, and residuals vs fitted plot for homoscedasticity (constant variance). A “megaphone” shape suggests heteroscedasticity.

In [27]:

# Normal Distribution of error terms
sns.histplot(residual , kde=True)
plt.show()

Out [27]:

Multicollinearity (VIF)

TV ≈ 2.49, radio ≈ 3.29, newspaper ≈ 3.06
Interpretation: How much each coefficient’s variance is inflated by linear relationships with other features.

Common rules of thumb:

VIF > 5 (or > 10) often flagged as problematic.
Your chosen threshold 4 is simply a more conservative cutoff.

Conclusion: All VIFs are below 4–5, so no serious multicollinearity here.

Why it matters: High multicollinearity → unstable coefficients, large standard errors, counterintuitive signs. If it occurs, consider:

Removing/combining correlated features,
Ridge (L2) regularization,
PCA or domain-driven feature engineering.

In [30]:

! pip install statsmodels
from statsmodels.stats.outliers_influence import variance_inflation_factor

Out [30]:

Collecting statsmodels
  Downloading statsmodels-0.14.5-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (9.5 kB)
Requirement already satisfied: numpy<3,>=1.22.3 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from statsmodels) (2.3.3)
Requirement already satisfied: scipy!=1.9.2,>=1.8 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from statsmodels) (1.16.2)
Requirement already satisfied: pandas!=2.1.0,>=1.4 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from statsmodels) (2.3.2)
Collecting patsy>=0.5.6 (from statsmodels)
  Downloading patsy-1.0.1-py2.py3-none-any.whl.metadata (3.3 kB)
Requirement already satisfied: packaging>=21.3 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from statsmodels) (25.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2025.2)
Requirement already satisfied: six>=1.5 in /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas!=2.1.0,>=1.4->statsmodels) (1.17.0)
Downloading statsmodels-0.14.5-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (10.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m11.1 MB/s[0m  [33m0:00:01[0meta [36m0:00:01[0m
[?25hDownloading patsy-1.0.1-py2.py3-none-any.whl (232 kB)
Installing collected packages: patsy, statsmodels
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [statsmodels][0m [statsmodels]
[1A[2KSuccessfully installed patsy-1.0.1 statsmodels-0.14.5

In [31]:

def vif_score(X):
    X_arr = X.values
    vif = [variance_inflation_factor(X_arr,i) for i in range(X.shape[1])]
    vif_score = pd.DataFrame()
    vif_score['vif_score'] = vif
    vif_score['Features']= X.columns
    return vif_score

In [32]:

vif_score(X)

Out [32]:

   vif_score   Features
0   2.486772         TV
1   3.285462      radio
2   3.055245  newspaper

If Vif_score > 4 --> MultiColinearity in the dataset
If Vif_score < 4 --> No MultiColinearity in the dataset