Modelling

Modeling in machine learning refers to the process of creating a mathematical or computational model that learns patterns from data to make predictions or decisions without being explicitly programmed to perform the task.

Steps:

Choose a model architecture (e.g., linear regression, decision tree, neural network).
Train the model on data (feed in input data and let the model learn from the outcomes).
Evaluate the model to see how well it performs.
Use the model to make predictions on new/unseen data.

Popular Python Modeling Libraries:

Library	Purpose	Key Features
scikit-learn	General ML	Wide variety of classical ML algorithms (classification, regression, clustering), easy API
XGBoost	Gradient Boosting	Fast and accurate gradient boosting implementation
LightGBM	Gradient Boosting	Fast, supports large datasets, better performance on categorical features
CatBoost	Gradient Boosting	Handles categorical features well automatically
TensorFlow	Deep Learning	Powerful, production-ready deep learning library
PyTorch	Deep Learning	Popular for research and development, flexible
Keras	Deep Learning	High-level API (now integrated with TensorFlow) for building deep learning models easily
Statsmodels	Statistical Modeling	Great for linear regression, time series analysis, econometrics

Interfacing between Pandas and Model Code

The interface between Pandas and machine learning model code refers to how data stored and manipulated using Pandas (such as DataFrames and Series) is connected to machine learning libraries such as Scikit-learn, TensorFlow, or PyTorch.

1. Data Preparation with Pandas

Pandas is widely used in the data preparation stage of the machine learning pipeline. This includes:

Loading data: Functions like pd.read_csv() and pd.read_excel() are used to load data into Pandas DataFrames from CSV or Excel files.
Cleaning data: Data cleaning involves handling missing values using methods like fillna() or dropna(), and removing duplicate records using drop_duplicates().
Transforming data: This step includes encoding categorical variables (e.g., with pd.get_dummies()), normalizing or scaling numerical features, and creating new features from existing columns.

2. Splitting Features (X) and Target (y)

Before training a machine learning model, the dataset is typically divided into two parts:

X (features): These are the input variables that the model will use to learn patterns.
y (target): This is the output variable or label that the model is trained to predict.

This is usually done by selecting the appropriate columns from the DataFrame.

3. Splitting Data Using

train_test_split()

To evaluate a model's performance, the dataset must be divided into training and testing sets. This is done using the train_test_split() function from Scikit-learn.
This function randomly splits the data into a training set (used to train the model) and a testing set (used to evaluate the model's performance).
It is typically used like this:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Component Type / Value Description
X DataFrame / Array The input features (independent variables) used to train the model.
y Series / Array The target values (dependent variable) that the model should predict.
test_size=0.2 Float (0.0–1.0) or Int Specifies the portion (or number) of data to be used for the test set. 0.2 = 20% test, 80% train.
random_state=42 Integer Controls the random shuffling before splitting. Ensures reproducible results. 

Variable Content Description
X_train Training features Subset of X used to train the model (e.g., 80% of rows).
X_test Testing features Subset of X used to evaluate the model (e.g., 20% of rows).
y_train Training target values Corresponding target values for X_train.
y_test Testing target values Corresponding target values for X_test.


4. Model Training and Fitting

After data preparation and splitting, the model is trained using the fit() function.

The fit() method in machine learning libraries such as Scikit-learn allows the model to learn the relationship between the input features (X_train) and the target variable (y_train).

During this stage, the algorithm analyzes patterns, relationships, or statistical dependencies in the data to determine model parameters (such as coefficients and intercepts in regression, or decision rules in tree-based models).

The syntax is: model.fit(X_train, y_train)

5. Making Predictions (predict() Function)
Once the model is trained, predictions are made using the predict() function.

This function uses the patterns learned during training to estimate the target variable for the test data (X_test).

The result is a set of predicted values (y_pred) that can be compared to the actual values (y_test) to assess accuracy.

The syntax is: 
y_pred = model.predict(X_test)
The model uses the trained parameters to compute output for unseen input data.
It transforms raw input features into expected outputs according to the learned mapping.

6. Model Evaluation
After obtaining predictions, the model's performance must be evaluated using suitable metrics depending on the type of problem (regression or classification). It's like measuring how well the model performs.


For Regression Models: Common metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), and R² Score.


For Classification Models: Metrics include Accuracy, Precision, Recall, F1-Score, and Confusion Matrix.


Evaluation functions are available in the sklearn.metrics module, which allows users to compare predicted results with actual outcomes and determine how well the model performs.

CODE:

# -----------------------------------------------
#  Import Libraries 
# -----------------------------------------------
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt

# -----------------------------------------------
#  Create Dataset using Pandas 
# -----------------------------------------------
data = {
    'Student': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'Math': [78, 45, 62, 89, 56, 90, 48, 70, 80, 55],
    'Science': [72, 50, 65, 85, 60, 95, 52, 68, 88, 58],
    'English': [75, 48, 70, 92, 66, 85, 55, 74, 90, 60],
    'Grade': ['B', 'C', 'B', 'A', 'C', 'A', 'C', 'B', 'A', 'C'],
    'Passed': [1, 0, 1, 1, 0, 1, 0, 1, 1, 0]  # 1=Pass, 0=Fail
}

df = pd.DataFrame(data)
print("--------------------------------------------------")
print(" Original DataFrame:\n")
print(df)

# -----------------------------------------------
#  Encode Categorical Data (Grade) 
# -----------------------------------------------
df['Grade_encoded'] = df['Grade'].map({'A': 3, 'B': 2, 'C': 1})

print("\n--------------------------------------------------")
print(" Data after Encoding:\n")
print(df)

# -----------------------------------------------
#  Define Features (X) and Target (y) 
# -----------------------------------------------
X = df[['Math', 'Science', 'English', 'Grade_encoded']]
y = df['Passed']

# -----------------------------------------------
#  Split Data into Training and Testing Sets 
# -----------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# -----------------------------------------------
#  Train the Model 
# -----------------------------------------------
model = LogisticRegression()
model.fit(X_train, y_train)

# -----------------------------------------------
#  Make Predictions 
# -----------------------------------------------
y_pred = model.predict(X_test)

# -----------------------------------------------
#  Evaluate the Model 
# -----------------------------------------------
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("\n--------------------------------------------------")
print(" Model Accuracy:", round(accuracy * 100, 2), "%")

print("\n--------------------------------------------------")
print(" Confusion Matrix:\n")
print(cm)

# -----------------------------------------------
#  Add Predictions Back to DataFrame 
# -----------------------------------------------
df.loc[X_test.index, 'Predicted'] = y_pred
print("\n--------------------------------------------------")
print(" DataFrame with Predictions:\n")
print(df)

# -----------------------------------------------
#  Display Testing Data used 
# -----------------------------------------------
print("\n--------------------------------------------------")
print(" Test Samples (X_test):\n")
print(X_test)

print("\n--------------------------------------------------")
print(" Actual vs Predicted on Test Set:\n")
print(pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}, index=X_test.index))

# -----------------------------------------------
#  Optional: Visualization 
# -----------------------------------------------
plt.scatter(df['Math'], df['Passed'], color='green', label='Actual')
plt.scatter(df['Math'], df['Predicted'], color='red', label='Predicted', marker='x')
plt.xlabel("Math Marks")
plt.ylabel("Pass (1) / Fail (0)")
plt.title("Actual vs Predicted Results")
plt.legend()
plt.show()

Output:

--------------------------------------------------
 Original DataFrame:

  Student  Math  Science  English Grade  Passed
0       A    78       72       75     B       1
1       B    45       50       48     C       0
2       C    62       65       70     B       1
3       D    89       85       92     A       1
4       E    56       60       66     C       0
5       F    90       95       85     A       1
6       G    48       52       55     C       0
7       H    70       68       74     B       1
8       I    80       88       90     A       1
9       J    55       58       60     C       0

--------------------------------------------------
 Data after Encoding:

  Student  Math  Science  English Grade  Passed  Grade_encoded
0       A    78       72       75     B       1              2
1       B    45       50       48     C       0              1
2       C    62       65       70     B       1              2
3       D    89       85       92     A       1              3
4       E    56       60       66     C       0              1
5       F    90       95       85     A       1              3
6       G    48       52       55     C       0              1
7       H    70       68       74     B       1              2
8       I    80       88       90     A       1              3
9       J    55       58       60     C       0              1

--------------------------------------------------
 Model Accuracy: 100.0 %

--------------------------------------------------
 Confusion Matrix:
[[1 0]
 [0 2]]

--------------------------------------------------
 DataFrame with Predictions:

  Student  Math  Science  English Grade  Passed  Grade_encoded  Predicted
0       A    78       72       75     B       1              2        NaN
1       B    45       50       48     C       0              1        0.0
2       C    62       65       70     B       1              2        NaN
3       D    89       85       92     A       1              3        NaN
4       E    56       60       66     C       0              1        NaN
5       F    90       95       85     A       1              3        NaN
6       G    48       52       55     C       0              1        0.0
7       H    70       68       74     B       1              2        NaN
8       I    80       88       90     A       1              3        1.0
9       J    55       58       60     C       0              1        NaN

--------------------------------------------------
 Test Samples (X_test):

   Math  Science  English  Grade_encoded
8    80       88       90              3
1    45       50       48              1
6    48       52       55              1

--------------------------------------------------
 Actual vs Predicted on Test Set:

   Actual  Predicted
8       1          1
1       0          0
6       0          0


PATSY – Python Library for Statistical Modeling
Patsy is a Python library used to describe statistical models and prepare data for analysis using formulas.

It helps us easily connect data (variables) to models like linear regression, logistic regression, etc.
We can think of Patsy as a translator between your data and the StatsModels library.

Patsy is a library that converts formulas like

'y ~ x1 + x2'

into numeric data (matrices) that can be used in statistical models.

Patsy Syntax :
Basic syntax:
'y ~ x1 + x2'



~ → separates dependent variable (y) and independent variables (x1, x2)


+ → adds more independent variables


- → removes a variable or intercept


* → includes both variables and their interaction (e.g., x1*x2)


: → only includes interaction term


np.log(x1) → applies transformation

Why is Patsy Used?




Reason
Explanation




Simple Formula Syntax
You can write models in a formula form instead of manually selecting columns.


Automatic Data Preparation
It automatically separates dependent (y) and independent (x) variables.


Works with StatsModels
Patsy is used along with the StatsModels library for model fitting.


Handles Categorical Data
It automatically converts text values into dummy (numeric) variables.


Supports Transformations
You can include mathematical functions like np.log(x1) or x1*x2 directly in formulas.


Reduces Manual Work
Saves time and reduces errors while preparing data for models.



Code:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import patsy

# Create sample data
df = pd.DataFrame({
    'y': [1, 3, 5, 7, 9],
    'x1': [2, 4, 6, 8, 10],
    'x2': [1, 2, 3, 4, 5]
})

# Create formula
formula = 'y ~ x1 + x2'

# Create model matrices
y, X = patsy.dmatrices(formula, data=df, return_type='dataframe')

# Fit model
model = sm.OLS(y, X)
results = model.fit()

# Show results
print(results.summary())  

Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.278e+31
Date:                Fri, 10 Oct 2025   Prob (F-statistic):           4.39e-47
Time:                        20:10:00   Log-Likelihood:                 174.06
No. Observations:                   5   AIC:                            -344.1
Df Residuals:                       2   BIC:                            -345.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0000   3.06e-15   1.31e-13      1.000      -1e-13       1e-13
x1             1.0000   3.06e-16   3.27e+15      0.000       1.000       1.000
x2          1.776e-15   1.09e-15      1.632      0.249   -3.32e-15    6.87e-15
==============================================================================
Omnibus:                        0.000   Durbin-Watson:                   1.000
Prob(Omnibus):                  1.000   Jarque-Bera (JB):                0.000
Skew:                           0.000   Prob(JB):                        1.000
Kurtosis:                       3.000   Cond. No.                         23.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Introduction to stat models:

Refer the below link:

Statmodels
Plotting
Matplotlib is a Python library for data visualization. It helps you create charts, graphs, and plots to better understand and present data. 

You can use it to draw:
Line charts
Bar charts
Pie charts
Scatter plots
Histograms
First, we have to import matplotlib.

import matplotlib.pyplot as plt

Here,
matplotlib → main library
pyplot → module that contains functions for creating plots
pyplot provides a state-based interface — it keeps track of the current figure and axes, so you can create and modify plots easily with simple function calls.

Commonly Used pyplot Functions:

Function Description
plt.plot() Draws line plots
plt.scatter() Creates scatter plots
plt.bar() Creates bar charts
plt.hist() Plots histograms
plt.pie() Creates pie charts
plt.title() Adds a title to the plot
plt.xlabel() / plt.ylabel() Labels axes
plt.legend() Adds a legend
plt.grid() Adds a grid
plt.show() Displays the plot

A grid is a set of horizontal and vertical lines drawn across the plot background to make reading values easier. 
plt.grid(True)
A legend is a box that explains what each line, color, or shape in your plot represents. 
plt.legend()

Refer the following link for all plots:

Visualisation

Search This Blog

Python for Data Analysis

MODELLING AND VISUALISATION

Modelling

Interfacing between Pandas and Model Code

6. Model Evaluation

PATSY – Python Library for Statistical Modeling

Plotting

Comments

Post a Comment

Popular posts from this blog

Introduction to Data Analysis & Python

Introduction to Pandas and Data Loading

Fundamentals of Python Programming & Numpy