MODELLING AND VISUALISATION

 Modelling

Modeling in machine learning refers to the process of creating a mathematical or computational model that learns patterns from data to make predictions or decisions without being explicitly programmed to perform the task.

Steps:
  1. Choose a model architecture (e.g., linear regression, decision tree, neural network).
  2. Train the model on data (feed in input data and let the model learn from the outcomes).
  3. Evaluate the model to see how well it performs.
  4. Use the model to make predictions on new/unseen data.
Popular Python Modeling Libraries:


LibraryPurposeKey Features
scikit-learnGeneral MLWide variety of classical ML algorithms (classification, regression, clustering), easy API
XGBoostGradient BoostingFast and accurate gradient boosting implementation
LightGBMGradient BoostingFast, supports large datasets, better performance on categorical features
CatBoostGradient BoostingHandles categorical features well automatically
TensorFlowDeep LearningPowerful, production-ready deep learning library
PyTorchDeep LearningPopular for research and development, flexible 
KerasDeep LearningHigh-level API (now integrated with TensorFlow) for building deep learning models easily
StatsmodelsStatistical ModelingGreat for linear regression, time series analysis, econometrics


Interfacing between Pandas and Model Code

The interface between Pandas and machine learning model code refers to how data stored and manipulated using Pandas (such as DataFrames and Series) is connected to machine learning libraries such as Scikit-learn, TensorFlow, or PyTorch

1. Data Preparation with Pandas

Pandas is widely used in the data preparation stage of the machine learning pipeline. This includes:
  • Loading data: Functions like pd.read_csv() and pd.read_excel() are used to load data into Pandas DataFrames from CSV or Excel files.
  • Cleaning data: Data cleaning involves handling missing values using methods like fillna() or dropna(), and removing duplicate records using drop_duplicates().
  • Transforming data: This step includes encoding categorical variables (e.g., with pd.get_dummies()), normalizing or scaling numerical features, and creating new features from existing columns.
2. Splitting Features (X) and Target (y) 

Before training a machine learning model, the dataset is typically divided into two parts:
  • X (features): These are the input variables that the model will use to learn patterns.
  • y (target): This is the output variable or label that the model is trained to predict.
This is usually done by selecting the appropriate columns from the DataFrame.

3. Splitting Data Using
train_test_split()

To evaluate a model's performance, the dataset must be divided into training and testing sets. This is done using the train_test_split() function from Scikit-learn.
  • This function randomly splits the data into a training set (used to train the model) and a testing set (used to evaluate the model's performance).
  • It is typically used like this:  
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

ComponentType / ValueDescription
XDataFrame / ArrayThe input features (independent variables) used to train the model.
ySeries / ArrayThe target values (dependent variable) that the model should predict.
test_size=0.2Float (0.0–1.0) or IntSpecifies the portion (or number) of data to be used for the test set. 0.2 = 20% test, 80% train.
random_state=42IntegerControls the random shuffling before splitting. Ensures reproducible results.


VariableContentDescription
X_trainTraining featuresSubset of X used to train the model (e.g., 80% of rows).
X_testTesting featuresSubset of X used to evaluate the model (e.g., 20% of rows).
y_trainTraining target valuesCorresponding target values for X_train.
y_testTesting target valuesCorresponding target values for X_test.


4. Model Training and Fitting

After data preparation and splitting, the model is trained using the fit() function.
The fit() method in machine learning libraries such as Scikit-learn allows the model to learn the relationship between the input features (X_train) and the target variable (y_train).
During this stage, the algorithm analyzes patterns, relationships, or statistical dependencies in the data to determine model parameters (such as coefficients and intercepts in regression, or decision rules in tree-based models).

The syntax is: model.fit(X_train, y_train)

5. Making Predictions (predict() Function)

Once the model is trained, predictions are made using the predict() function.
This function uses the patterns learned during training to estimate the target variable for the test data (X_test).
The result is a set of predicted values (y_pred) that can be compared to the actual values (y_test) to assess accuracy.

The syntax is: 
y_pred = model.predict(X_test)
  • The model uses the trained parameters to compute output for unseen input data.
  • It transforms raw input features into expected outputs according to the learned mapping.

6. Model Evaluation

After obtaining predictions, the model's performance must be evaluated using suitable metrics depending on the type of problem (regression or classification). It's like measuring how well the model performs.

  • For Regression Models: Common metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), and R² Score.
  • For Classification Models: Metrics include Accuracy, Precision, Recall, F1-Score, and Confusion Matrix.
Evaluation functions are available in the sklearn.metrics module, which allows users to compare predicted results with actual outcomes and determine how well the model performs.

CODE:

# -----------------------------------------------
#  Import Libraries 
# -----------------------------------------------
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt

# -----------------------------------------------
#  Create Dataset using Pandas 
# -----------------------------------------------
data = {
    'Student': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'Math': [78, 45, 62, 89, 56, 90, 48, 70, 80, 55],
    'Science': [72, 50, 65, 85, 60, 95, 52, 68, 88, 58],
    'English': [75, 48, 70, 92, 66, 85, 55, 74, 90, 60],
    'Grade': ['B', 'C', 'B', 'A', 'C', 'A', 'C', 'B', 'A', 'C'],
    'Passed': [1, 0, 1, 1, 0, 1, 0, 1, 1, 0]  # 1=Pass, 0=Fail
}

df = pd.DataFrame(data)
print("--------------------------------------------------")
print(" Original DataFrame:\n")
print(df)

# -----------------------------------------------
#  Encode Categorical Data (Grade) 
# -----------------------------------------------
df['Grade_encoded'] = df['Grade'].map({'A': 3, 'B': 2, 'C': 1})

print("\n--------------------------------------------------")
print(" Data after Encoding:\n")
print(df)

# -----------------------------------------------
#  Define Features (X) and Target (y) 
# -----------------------------------------------
X = df[['Math', 'Science', 'English', 'Grade_encoded']]
y = df['Passed']

# -----------------------------------------------
#  Split Data into Training and Testing Sets 
# -----------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# -----------------------------------------------
#  Train the Model 
# -----------------------------------------------
model = LogisticRegression()
model.fit(X_train, y_train)

# -----------------------------------------------
#  Make Predictions 
# -----------------------------------------------
y_pred = model.predict(X_test)

# -----------------------------------------------
#  Evaluate the Model 
# -----------------------------------------------
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("\n--------------------------------------------------")
print(" Model Accuracy:", round(accuracy * 100, 2), "%")

print("\n--------------------------------------------------")
print(" Confusion Matrix:\n")
print(cm)

# -----------------------------------------------
#  Add Predictions Back to DataFrame 
# -----------------------------------------------
df.loc[X_test.index, 'Predicted'] = y_pred
print("\n--------------------------------------------------")
print(" DataFrame with Predictions:\n")
print(df)

# -----------------------------------------------
#  Display Testing Data used 
# -----------------------------------------------
print("\n--------------------------------------------------")
print(" Test Samples (X_test):\n")
print(X_test)

print("\n--------------------------------------------------")
print(" Actual vs Predicted on Test Set:\n")
print(pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}, index=X_test.index))

# -----------------------------------------------
#  Optional: Visualization 
# -----------------------------------------------
plt.scatter(df['Math'], df['Passed'], color='green', label='Actual')
plt.scatter(df['Math'], df['Predicted'], color='red', label='Predicted', marker='x')
plt.xlabel("Math Marks")
plt.ylabel("Pass (1) / Fail (0)")
plt.title("Actual vs Predicted Results")
plt.legend()
plt.show()

Output:

--------------------------------------------------
 Original DataFrame:

  Student  Math  Science  English Grade  Passed
0       A    78       72       75     B       1
1       B    45       50       48     C       0
2       C    62       65       70     B       1
3       D    89       85       92     A       1
4       E    56       60       66     C       0
5       F    90       95       85     A       1
6       G    48       52       55     C       0
7       H    70       68       74     B       1
8       I    80       88       90     A       1
9       J    55       58       60     C       0

--------------------------------------------------
 Data after Encoding:

  Student  Math  Science  English Grade  Passed  Grade_encoded
0       A    78       72       75     B       1              2
1       B    45       50       48     C       0              1
2       C    62       65       70     B       1              2
3       D    89       85       92     A       1              3
4       E    56       60       66     C       0              1
5       F    90       95       85     A       1              3
6       G    48       52       55     C       0              1
7       H    70       68       74     B       1              2
8       I    80       88       90     A       1              3
9       J    55       58       60     C       0              1

--------------------------------------------------
 Model Accuracy: 100.0 %

--------------------------------------------------
 Confusion Matrix:
[[1 0]
 [0 2]]

--------------------------------------------------
 DataFrame with Predictions:

  Student  Math  Science  English Grade  Passed  Grade_encoded  Predicted
0       A    78       72       75     B       1              2        NaN
1       B    45       50       48     C       0              1        0.0
2       C    62       65       70     B       1              2        NaN
3       D    89       85       92     A       1              3        NaN
4       E    56       60       66     C       0              1        NaN
5       F    90       95       85     A       1              3        NaN
6       G    48       52       55     C       0              1        0.0
7       H    70       68       74     B       1              2        NaN
8       I    80       88       90     A       1              3        1.0
9       J    55       58       60     C       0              1        NaN

--------------------------------------------------
 Test Samples (X_test):

   Math  Science  English  Grade_encoded
8    80       88       90              3
1    45       50       48              1
6    48       52       55              1

--------------------------------------------------
 Actual vs Predicted on Test Set:

   Actual  Predicted
8       1          1
1       0          0
6       0          0


PATSY – Python Library for Statistical Modeling

Patsy is a Python library used to describe statistical models and prepare data for analysis using formulas.
It helps us easily connect data (variables) to models like linear regression, logistic regression, etc.
We can think of Patsy as a translator between your data and the StatsModels library.

Patsy is a library that converts formulas like
'y ~ x1 + x2'
into numeric data (matrices) that can be used in statistical models.

Patsy Syntax :

Basic syntax:

'y ~ x1 + x2'
  • ~ → separates dependent variable (y) and independent variables (x1, x2)
  • + → adds more independent variables
  • - → removes a variable or intercept
  • * → includes both variables and their interaction (e.g., x1*x2)
  • : → only includes interaction term
  • np.log(x1) → applies transformation

Why is Patsy Used?

Reason Explanation
Simple Formula Syntax You can write models in a formula form instead of manually selecting columns.
Automatic Data Preparation It automatically separates dependent (y) and independent (x) variables.
Works with StatsModels Patsy is used along with the StatsModels library for model fitting.
Handles Categorical Data It automatically converts text values into dummy (numeric) variables.
Supports Transformations You can include mathematical functions like np.log(x1) or x1*x2 directly in formulas.
Reduces Manual Work Saves time and reduces errors while preparing data for models.


Code:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import patsy

# Create sample data
df = pd.DataFrame({
    'y': [1, 3, 5, 7, 9],
    'x1': [2, 4, 6, 8, 10],
    'x2': [1, 2, 3, 4, 5]
})

# Create formula
formula = 'y ~ x1 + x2'

# Create model matrices
y, X = patsy.dmatrices(formula, data=df, return_type='dataframe')

# Fit model
model = sm.OLS(y, X)
results = model.fit()

# Show results
print(results.summary())  

Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.278e+31
Date:                Fri, 10 Oct 2025   Prob (F-statistic):           4.39e-47
Time:                        20:10:00   Log-Likelihood:                 174.06
No. Observations:                   5   AIC:                            -344.1
Df Residuals:                       2   BIC:                            -345.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0000   3.06e-15   1.31e-13      1.000      -1e-13       1e-13
x1             1.0000   3.06e-16   3.27e+15      0.000       1.000       1.000
x2          1.776e-15   1.09e-15      1.632      0.249   -3.32e-15    6.87e-15
==============================================================================
Omnibus:                        0.000   Durbin-Watson:                   1.000
Prob(Omnibus):                  1.000   Jarque-Bera (JB):                0.000
Skew:                           0.000   Prob(JB):                        1.000
Kurtosis:                       3.000   Cond. No.                         23.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Introduction to stat models:

Refer the below link:

Plotting

Matplotlib is a Python library for data visualization. It helps you create charts, graphs, and plots to better understand and present data. 

You can use it to draw:
  • Line charts
  • Bar charts
  • Pie charts
  • Scatter plots
  • Histograms
First, we have to import matplotlib.

import matplotlib.pyplot as plt

Here,
  • matplotlib → main library
  • pyplot → module that contains functions for creating plots
pyplot provides a state-based interface — it keeps track of the current figure and axes, so you can create and modify plots easily with simple function calls.

Commonly Used pyplot Functions:

FunctionDescription
plt.plot()Draws line plots
plt.scatter()Creates scatter plots
plt.bar()Creates bar charts
plt.hist()Plots histograms
plt.pie()Creates pie charts
plt.title()Adds a title to the plot
plt.xlabel() / plt.ylabel()Labels axes
plt.legend()Adds a legend
plt.grid()Adds a grid
plt.show()Displays the plot

  • grid is a set of horizontal and vertical lines drawn across the plot background to make reading values easier. 
plt.grid(True)
  • legend is a box that explains what each line, color, or shape in your plot represents. 
plt.legend()

Refer the following link for all plots:













Comments

Popular posts from this blog

Introduction to Data Analysis & Python

Introduction to Pandas and Data Loading

Fundamentals of Python Programming & Numpy