What is Ordinary Least Squares (OLS) With Example
Ordinary Least Squares (OLS) is a simple, but popular, technique we use for linear regression problems.
Further in this post, we’ll delve into what it is in detail and answer some common questions surrounding its use.
What is Ordinary Least Squares and How Does It Relate to Linear Regression?
Ordinary Least Squares is a mathematical technique we use to estimate the parameters of a linear regression model.
In fact, OLS is the most common method for performing linear regression. Furthermore, we call it “ordinary” because it’s the simplest and most popular technique for minimizing the sum of squared differences between observed and predicted values.
How to Calculate Ordinary Least Squares for Regression
Here’s a step-by-step process to calculate OLS for regression:
- Gather data with at least two variables (an independent and dependent variable)
- Create a scatter plot of the data points
- Draw a straight line (the regression line) that minimizes the sum of the squared differences between each data point and the line
- Calculate the slope and intercept of the regression line, which are the OLS estimates
Common Questions About Ordinary Least Squares
What is the difference between ANOVA and OLS?
While both OLS and Analysis of Variance (ANOVA) are statistical techniques, they serve different purposes.
OLS is for linear regression and focuses on estimating the relationship between variables, whereas we use ANOVA for comparing the means of multiple groups or samples.
What is the difference between OLS and multiple regression?
OLS is a technique we can apply in both simple and multiple linear regression.
To clarify, the difference between them is that simple linear regression has only one independent variable, while multiple regression has two or more independent variables.
Practical Example with Code
In the following code snippet, we’re going to download housing data with Kaggle API, preprocess it and train a linear regression model with it.
Furthermore, we’re going to use Mean Squared Error (MSE) and R2 score metrics to evaluate our model.
import pandas as pd
import numpy as np
from kaggle.api.kaggle_api_extended import KaggleApi
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
#authenticate API connection with Kaggle
api = KaggleApi()
api.authenticate()
#download the housing dataset from https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
api.dataset_download_file(
'yasserh/housing-prices-dataset',
file_name='housing.csv',
path='datasets'
)
#import dataset and remove rows that have missing values, if there are any
df = pd.read_csv('datasets/Housing.csv')
df.dropna()
print(df.head())
#split dataset to dependent and independent values for linear regression
independent_df = df.iloc[:,1:5]
bool_categories = ['mainroad', 'guestroom', 'basement', 'prefarea', 'hotwaterheating', 'airconditioning']
for cat in bool_categories:
independent_df[cat] = df[cat].astype('category').cat.codes
print(independent_df)
#turn off pandas warning - doesn't effect the result, just cleans the console output
pd.set_option('mode.chained_assignment', None)
dependent_df = df[['price']]
dependent_df['price_log'] = np.log(dependent_df['price'] + 1)
print(dependent_df)
X = independent_df
y = dependent_df['price_log']
#split dataset into training and testing partitions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#import the model and train it
model = LinearRegression()
model.fit(X_train, y_train)
#make predictions on the test data
y_pred = model.predict(X_test)
#evaluate the results using MSE and R2
#lower MSE indicates better performance
#higher R2 indicates better performance (0 - 1 range)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R2 Score: {r2:.2f}')
Conclusion
In conclusion Ordinary Least Squeares technique is popular because it’s simple, easy to interpret, and often provides unbiased estimators.
However, it’s essential to consider its limitations before using it, as it might not always be the best choice for every situation.