What is Ridge Regression With Example Using Python
Ridge regression is a regularization technique we can use to prevent overfitting in linear regression models.
Moreover, by adding a penalty term (L2 regularization) to the loss function, it helps control the complexity of the model. Thus improving its generalization capabilities.
Further in this article, we’ll dive into the details surrounding it, explore its benefits, and compare it to other techniques.
Ridge Regression in Action: An Example with Python
Following code snippet will demonstrate a simple example by using a housing dataset. Furthermore, we’re going to download the dataset from Kaggle, using their API and preprocess it for our model.
Moreover, the dataset contains house prices and various features they possess, like if there’s a basement, air conditioning and more.
Furthermore, by using this type of regression we can create a model that predicts house prices based on these features while avoiding overfitting.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from kaggle.api.kaggle_api_extended import KaggleApi
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
#authenticate API connection with Kaggle
api = KaggleApi()
api.authenticate()
#download the housing dataset from https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
api.dataset_download_file(
'yasserh/housing-prices-dataset',
file_name='housing.csv',
path='datasets'
)
#import dataset and remove rows that have missing values, if there are any
df = pd.read_csv('datasets/Housing.csv')
df.dropna()
print(df.head())
#split dataset to dependent and independent values for linear regression
independent_df = df.iloc[:,1:5]
bool_categories = ['mainroad', 'guestroom', 'basement', 'prefarea', 'hotwaterheating', 'airconditioning']
for cat in bool_categories:
independent_df[cat] = df[cat].astype('category').cat.codes
print(independent_df)
#turn off pandas warning - doesn't effect the result, just cleans the console output
pd.set_option('mode.chained_assignment', None)
dependent_df = df[['price']]
dependent_df['price_log'] = np.log(dependent_df['price'] + 1)
print(dependent_df)
X = independent_df
y = dependent_df['price_log']
#split dataset into training and testing partitions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#import the model and train it
model = Ridge(alpha=100, solver='cholesky', tol=0.0001, random_state=42)
model.fit(X_train, y_train)
#make predictions on the test data
y_pred = model.predict(X_test)
#evaluate the results using MSE and R2
#lower MSE indicates better performance
#higher R2 indicates better performance (0 - 1 range)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R2 Score: {r2:.2f}')
The Disadvantages
One disadvantage is that it doesn’t perform feature selection, as it only shrinks coefficients without setting them to zero.
Moreover, choosing the appropriate regularization parameter (alpha) can be challenging.
L1 vs. L2 Regularization
Ridge regression employs L2 regularization, which adds the squared values of the coefficients to the loss function.
In contrast, Lasso regression uses L1 regularization, adding the absolute values of the coefficients.
Why Choose It Over OLS?
When dealing with multicollinearity or a high-dimensional dataset, it can outperform ordinary least squares (OLS) by mitigating overfitting and providing more stable coefficient estimates.
Lasso vs. Ridge Regression: Which One to Choose?
While both offer regularization, their choice depends on the problem at hand.
Lasso is suitable for feature selection, while ridge regression works well for handling multicollinearity.
In Conclusion
Ridge regression is a powerful tool for preventing overfitting in linear regression models.
So, by understanding its intricacies, advantages, and limitations, you can make informed decisions about whether to use it in your machine learning projects.
I hope this article helped you gain a better understanding about ridge regression, and perhaps even inspire you to learn even more.