Data Cleaning

 Data Cleaning and Preparation

Handling Missing Values

Missing values are entries in your dataset that have no recorded data. In Python and Pandas, missing values are typically represented as:
  • NaN (Not a Number) — from NumPy
  • None — a Python built-in null object
Example:

import pandas as pd
import numpy as np

data = {
    "Name": ["Ramya", None, "Swathi"],
    "Age": [24, 28, np.nan],
    "City": ["Chennai", "Delhi", None]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name   Age    City
0  Ramya  24.0  Chennai
1   None  28.0    Delhi
2 Swathi   NaN     None

Why Should We Handle or Remove Missing Values?

Missing data can significantly affect the quality and outcome of data analysis or machine learning models.

1. Accuracy and Validity
    If missing values are ignored, calculations like mean, sum, correlation, etc., can give incorrect results.

2. Model Errors in Machine Learning
    Most machine learning algorithms cannot work with missing values. They will throw errors unless the data is cleaned.

3. Bias in Analysis
    If missing values are unevenly distributed across categories (e.g., mostly missing from one city or gender), it can lead to biased insights.

4. Data Integrity
    Missing data can indicate issues in data collection, sensors, user input, or system bugs. It's important to understand the cause.

Detecting missing values

Missing values often occur due to incomplete data collection, errors during data entry, or system-related issues. To work effectively with your data, the first step is to detect where these missing values occur.

Function Description
df.isnull() Returns a DataFrame of the same shape with True where data is missing
df.notnull() Returns True where data is not missing
df.isnull().sum() Sums the number of NaN values in each column

 Example:

import pandas as pd
import numpy as np

# Sample dataset with some missing values
data = {
    "Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish", None],
    "Age": [23, np.nan, 27, 24, None, 29],
    "City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore", None]
}

df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Check for missing values (True for NaN or None)
print("\nDetecting missing values with df.isnull():")
print(df.isnull())

# Count total missing values per column
print("\nMissing values per column:")
print(df.isnull().sum())

# Detect non-missing (valid) values
print("\nDetecting non-missing values with df.notnull():")
print(df.notnull())

Output:

Original DataFrame:
     Name    Age       City
0   Ramya  23.0      Hyderabad
1   Swathi   NaN     Chennai
2   Lalitha   27.0      None
3   Lucky    24.0      Delhi
4   Harish   NaN      Bangalore
5    None    29.0      None

Detecting missing values with df.isnull():
    Name    Age   City
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False  False
4  False   True  False
5   True  False   True

Missing values per column:
Name    1
Age     2
City    2
dtype: int64

Detecting non-missing values with df.notnull():
    Name   Age   City
0   True  True  True
1   True False  True
2   True  True False
3   True  True  True
4   True False  True
5  False  True False

Handling Missing Data Techniques

1. Deletion

Once missing values are detected, the next step is often to remove them, either rows or columns, depending on your analysis goals.

Why Remove Missing Values?
  • When missing values are few and do not significantly affect the dataset.
  • When you don't want to fill them with default or estimated values.
  • To ensure compatibility with tools/models that don’t accept NaNs.
Example:

import pandas as pd
import numpy as np

data = {
    "Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],
    "Age": [23, np.nan, 27, 24, None],
    "City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

     Name   Age       City
0   Ramya  23.0  Hyderabad
1  Swathi   NaN    Chennai
2 Lalitha  27.0       None
3   Lucky  24.0      Delhi
4  Harish   NaN  Bangalore

1. Remove Rows with Any Missing Values
 
df_any = df.dropna(how='any')
print("\nRows with any missing values removed:")
print(df_any)

Removes a row if even one value is missing.
Rows 1, 2, and 4 contain at least one NaN → they get dropped.

Output:

    Name   Age      City
0  Ramya  23.0  Hyderabad
3  Lucky  24.0     Delhi

2. Remove Rows with All Missing Values

Example:

import pandas as pd
import numpy as np

data = {
    "Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish", None],
    "Age": [23, np.nan, 27, 24, None, None],
    "City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore", None]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

df_all = df.dropna(how='all')
print("\nRows with all missing values removed:")
print(df_all)

Removes a row only if all values are missing.
5th row have all values missing -> they got dropped.

Output:
     Name   Age       City
0   Ramya  23.0  Hyderabad
1  Swathi   NaN    Chennai
2 Lalitha  27.0       None
3   Lucky  24.0      Delhi
4  Harish   NaN  Bangalore

3. Drop Columns with Any Missing Values

Example:

import pandas as pd
import numpy as np

data = {
    "Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],
    "Age": [23, np.nan, 27, 24, None],
    "City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
df_col_any = df.dropna(axis=1, how='any')
print("\nColumns with any missing values removed:")
print(df_col_any)

Removes a column if any row contains a missing value.
Both Age and City have missing values → they are dropped.
Only Name remains.

Output:

     Name
0   Ramya
1  Swathi
2 Lalitha
3   Lucky
4  Harish

4. Drop Columns with All Missing Values

df_col_all = df.dropna(axis=1, how='all')
print("\nColumns with all missing values removed:")
print(df_col_all)

Removes a column only if all values in it are missing.
No column has all missing values → nothing is dropped.

Output:

     Name   Age       City
0   Ramya  23.0  Hyderabad
1  Swathi   NaN    Chennai
2 Lalitha  27.0       None
3   Lucky  24.0      Delhi
4  Harish   NaN  Bangalore

 2. Imputation

Imputation is the process of replacing missing data with substitute values so that the dataset remains usable for analysis or modeling.

Why Use Imputation?
  • Maintains the size of the dataset (no loss of rows)
  • Avoids errors during model training (many ML models can't handle NaN).
  • Helps preserve important patterns in the data.
  • Better than dropping data, especially when the missing values are few.
1. Mean Imputation (For numeric columns)
    
Replace missing values with the mean of the column.
Use when data is numeric and not heavily skewed (no outliers).

Example:

import pandas as pd
import numpy as np

data = {
    "Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],
    "Age": [23, np.nan, 27, 24, None],
    "City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]
}

df = pd.DataFrame(data)
print("Original Data:")
print(df)

df['Age'].fillna(df['Age'].mean(), inplace=True)
print("Data after mean Imputation:")
print(df)

Output:

Original Data:
     Name   Age       City
0   Ramya  23.0  Hyderabad
1  Swathi   NaN    Chennai
2 Lalitha  27.0       None
3   Lucky  24.0      Delhi
4  Harish   NaN  Bangalore

Data after mean Imputation:
     Name        Age       City
0   Ramya  23.000000  Hyderabad
1  Swathi  24.666667    Chennai
2 Lalitha  27.000000       None
3   Lucky  24.000000      Delhi
4  Harish  24.666667  Bangalore

2. Mode Imputation (for categorical columns)

Fill missing string values with the most frequent value (mode).
Use for categorical columns (City, Name) when you want to fill with the most common value.

df['City'].fillna(df['City'].mode()[0], inplace=True)
print(df)

Output:

     Name        Age       City
0   Ramya  23.000000  Hyderabad
1  Swathi  24.666667    Chennai
2 Lalitha  27.000000    Chennai
3   Lucky  24.000000      Delhi
4  Harish  24.666667  Bangalore

3. Median Imputation

Replace missing values with the median of that column.
Use when numeric data has outliers (median is more robust).

# Reset original data
df = pd.DataFrame(data)
df['Age'].fillna(df['Age'].median(), inplace=True)
print(df)

Output:

     Name   Age       City
0   Ramya  23.0  Hyderabad
1  Swathi  24.0    Chennai
2 Lalitha  27.0       None
3   Lucky  24.0      Delhi
4  Harish  24.0  Bangalore

4. Constant Value Imputation

You can fill missing values with fixed placeholders like "Unknown" or 0.
Use when you want to mark missing values clearly.

# Fill missing values in Age and City
df = pd.DataFrame(data)
df.fillna({"Age": 0, "City": "Unknown"}, inplace=True)
print(df)

Output:

     Name   Age       City
0   Ramya  23.0  Hyderabad
1  Swathi   0.0    Chennai
2 Lalitha  27.0    Unknown
3   Lucky  24.0      Delhi
4  Harish   0.0  Bangalore

5. Forward Fill 

You can fill missing values from the previous row.
Use for time series or where previous values can be logically carried forward.

df = pd.DataFrame(data)
df.fillna(method='ffill', inplace=True)
print(df)

Output:

     Name   Age       City
0   Ramya  23.0  Hyderabad
1  Swathi  23.0    Chennai
2 Lalitha  27.0    Chennai
3   Lucky  24.0      Delhi
4  Harish  24.0  Bangalore

6. Backward Fill

You can fill in the missing values from the next row.

df = pd.DataFrame(data)
df.fillna(method='bfill', inplace=True)
print(df)

Output:

     Name   Age       City
0   Ramya  23.0  Hyderabad
1  Swathi  27.0    Chennai
2 Lalitha  27.0      Delhi
3   Lucky  24.0      Delhi
4  Harish   NaN  Bangalore

7. Interpolation

Interpolation is a technique used to estimate missing values by using the values before and after the missing one. Instead of filling NaN with static values (like mean or mode), interpolation intelligently estimates what the missing value should be based on existing data.

import pandas as pd
import numpy as np

data = {
    "Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],
    "Age": [23, np.nan, 27, 24, None],
    "City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]
}

df = pd.DataFrame(data)
print("Original Data:")
print(df)

# Interpolate missing Age values
df['Age'] = df['Age'].interpolate(method='linear')
print("After Interpolation:")
print(df)

Output:

Original Data:
     Name   Age       City
0   Ramya  23.0  Hyderabad
1  Swathi   NaN    Chennai
2 Lalitha  27.0       None
3   Lucky  24.0      Delhi
4  Harish   NaN  Bangalore

After Interpolation:
     Name   Age       City
0   Ramya  23.0  Hyderabad
1  Swathi  25.0    Chennai
2 Lalitha  27.0       None
3   Lucky  24.0      Delhi
4  Harish  24.0  Bangalore

Outliers

An outlier is a data point that is very different (too high or too low) compared to the rest of the data.

Example:
Salaries = [30k, 35k, 32k, 34k, 5 crore] 
Here, 5 crore is an outlier.

Outliers can disturb averages, scaling, and machine learning models. So sometimes we remove or treat them.

Outlier Detection techniques:

1. IQR Method (Interquartile Range)

Based on percentiles:
  • Q1 = 25th percentile
  • Q3 = 75th percentile
  • IQR = Q3 − Q1
  • Outlier limits:
    • Lower = Q1 − 1.5 × IQR
    • Upper = Q3 + 1.5 × IQR
IQR is more robust for skewed or non-normal data.

Example:

import pandas as pd

# Step 1: Create the DataFrame
data = {'Salary': [30000, 32000, 35000, 34000, 37000, 36000, 39000, 41000, 43000, 5000000]}
df = pd.DataFrame(data)

# Step 2: Calculate Q1, Q3, and IQR
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Step 3: Define outlier limits
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR

# Step 4: Filter out outliers
df_iqr = df[(df['Salary'] >= lower_limit) & (df['Salary'] <= upper_limit)]

print("Cleaned Data using IQR:")
print(df_iqr)
 
Output:

Cleaned Data using IQR:
   Salary
0   30000
1   32000
2   35000
3   34000
4   37000
5   36000
6   39000
7   41000
8   43000

Step-by-Step Manual Calculation
  1. Sort the data in ascending order:
    [30000, 32000, 34000, 35000, 36000, 37000, 39000, 41000, 43000, 5000000]
  2. There are 10 values (even number), so:

    • Lower half (first 5 values):
      [30000, 32000, 34000, 35000, 36000]
      Q1 = middle value = 34000

    • Upper half (last 5 values):
      [37000, 39000, 41000, 43000, 5000000]
      Q3 = middle value = 41000

  3. IQR (Interquartile Range) = Q3 − Q1
    IQR = 41000 − 34000 = 7000

  4. Outlier limits:
    • Lower limit = Q1 − 1.5 × IQR
      34000 − 1.5 × 7000 = 34000 − 10500 = 23500

    • Upper limit = Q3 + 1.5 × IQR
      41000 + 1.5 × 7000 = 41000 + 10500 = 51500

  5. Conclusion:
    • Any value below 23500 or above 51500 is an outlier.
    • In this case, 5000000 is an outlier, and all other values are within range.

2. Z-score

Z-score measures how many standard deviations a value is from the mean.
Z-score works better for normally distributed data.
Formula:
z=xmeanstdz = \frac{x - \text{mean}}{\text{std}}
Values with z > 3 or z < -3 are usually considered outliers (can be adjusted depending on the use case).

Example:

from scipy import stats
import numpy as np

# Step 1: Calculate Z-scores
z_scores = np.abs(stats.zscore(df['Salary']))

# Step 2: Filter based on threshold
df_z_clean = df[(z_scores < 3)]

print("Cleaned Data using Z-Score:")
print(df_z_clean)

Output:
Cleaned Data using Z-Score:
   Salary
0   30000
1   32000
2   35000
3   34000
4   37000
5   36000
6   39000
7   41000
8   43000

Calculation:
Mean ≈ 552000
Standard Deviation ≈ 1,470,000
Outlier threshold = ±3 standard deviations
Outlier detected: 5000000 (Z ≈ 3.01)

3. Visualization (Boxplot)

Visualizations are a quick way to see outliers.

Example:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Define salary data
Salaries = [30000, 32000, 35000, 34000, 37000, 36000, 39000, 41000, 43000, 5000000]

# Create DataFrame with column name 'Salary'
df = pd.DataFrame({'Salary': Salaries})

# Plot boxplot
plt.boxplot(df['Salary'])
plt.title("Salary Distribution (with Outlier)")
plt.ylabel("Salary")
plt.show()

Output:




The box spans from Q1 (34000) to Q3 (41000).
The line inside the box is the median (around 36500).
The dot above the box represents the outlier 5000000.

Data Transformation

Data transformation refers to the process of converting raw data into a suitable format that enables a computer (such as a machine learning model, database, or analysis tool) to understand, compare, and utilize it effectively. 

⭐Raw data often has problems:
  • Different scales (e.g., marks out of 100 vs age in years). 
    Student Math (out of 100) Height (cm)

        A         90                             150
        B         80                             170
        C         70                             160

     If you directly give this table to a computer (machine learning model):
Math is between 70–90.
Height is between 150–170.

So the computer thinks height is more important (bigger numbers) than math marks, even though we care about both equally. So, without scaling features with large values dominate.
  • Different formats (e.g., text “A/B/C” vs numbers).
Computers don’t understand “A, B, C” — they need numbers.
  • Missing values.
  • Outliers (extreme values).

Data Transformation Techniques

1. Scaling / Normalization / Standardization

Scaling is used to bring all numerical values to a common range, especially when features have different units (like marks and age).

Types:
  • Normalization (Min-Max Scaling): scales values between 0 and 1

Formula for Normalization

x=xminmaxmin

 
Where:
x = each value in the column
min = smallest value
max = largest value
👉 This transformation scales all values between 0 and 1.
        
For example, numbers = [10, 20, 30, 40, 50]
min=10
max=50
 
Now apply the formula: 
 
Value
Calculation
Result
10
(10 − 10) / (50 − 10) = 0 / 40
0.00
20
(20 − 10) / 40 = 10 / 40
0.25
30
(30 − 10) / 40 = 20 / 40
0.50
40
(40 − 10) / 40 = 30 / 40
0.75
50
(50 − 10) / 40 = 40 / 40
1.00

 

All values are now between 0 and 1.

This is useful for algorithms like KNN, SVM, and Neural Networks that are sensitive to scale.
✅ Use Min–Max Scaling when you know the min and max are fixed and outliers are not a big issue.  

Program:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample data
df = pd.DataFrame({
    'Marks': [40, 60, 80, 100],
    'Age': [20, 25, 30, 35]
})

# Apply Min-Max Scaling
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)

df_scaled = pd.DataFrame(scaled, columns=['Marks', 'Age'])
print("Scaled Data:")
print(df_scaled)

Output:

Scaled Data:
   Marks  Age
0    0.0  0.00
1    0.5  0.25
2    0.9  0.50
3    1.0  0.75
 
  • Standardization (Z-score): transforms values to have mean = 0 and std = 1
This makes data centered and scaled, which is especially useful for models that assume a normal distribution (e.g., logistic regression, SVM, etc.)

Formula for Standardization
x=xμσ
Where:
x = each value
μ = mean of the data
σ = standard deviation of the data

For example, numbers = [10, 20, 30, 40, 50]
mean= 30
σ ≈ 14.14

Now apply the formula: 

x=x3014.14x' = \frac{x - 30}{14.14}

Original
Calculation
Z-score (Standardized)
10
(10 - 30) / 14.14 = -20 / 14.14
-1.41
20
(20 - 30) / 14.14 = -10 / 14.14
-0.71
30
(30 - 30) / 14.14 = 0
0.00
40
(40 - 30) / 14.14 = 10 / 14.14
0.71
50
(50 - 30) / 14.14 = 20 / 14.14
1.41
         
Z-score scaling centers the data and removes scale-related bias.

Program:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Step 1: Create the DataFrame
data = {'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Step 2: Apply Standardization (Z-score)
scaler = StandardScaler()
df['Standardized'] = scaler.fit_transform(df[['Value']])

# Step 3: Print Result
print("Standardized Data:")
print(df)


Output:
   Value  Standardized
0     10     -1.414214
1     20     -0.707107
2     30      0.000000
3     40      0.707107
4     50      1.414214

2. Encoding

Machine learning models need numeric data, so text (categorical) values must be encoded.
⦿ Label Encoding

Label Encoding converts categorical text values into numeric labels (integers).
Each unique category is assigned a unique integer.  
 
Example:
Input:    ['Red', 'Green', 'Blue']
Encoded:  [  2  ,   1   ,   0   ]
 
 
Program:

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Grade': ['A', 'B', 'A', 'C']})
encoder = LabelEncoder()
df['Grade_Encoded'] = encoder.fit_transform(df['Grade'])
print(df)

Output:

  Grade  Grade_Encoded
0     A              0
1     B              1
2     A              0
3     C              2

⦿ One-Hot Encoding
 One-Hot Encoding converts each category into a binary vector (0s and 1s).
Each unique category becomes its own column with 1 indicating presence.
Example:
Input:    ['Red', 'Green', 'Blue']

Encoded:
   Red  Green  Blue
0   1      0     0
1   0      1     0
2   0      0     1
 
Program: 
 
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Grade': ['A', 'B', 'A', 'C']})
df = pd.DataFrame({'Grade': ['A', 'B', 'A', 'C']})
df_one_hot = pd.get_dummies(df, columns=['Grade'])
print(df_one_hot)

Output:
 
   Grade_A  Grade_B  Grade_C
0        1        0        0
1        0        1        0
2        1        0        0
3        0        0        1

 3. Handling Missing Values

Missing values (NaN) can break models. They should be filled (imputed) using strategies like mean, median, mode, or interpolation.

4. Removing Outliers

Outliers are extreme values that can distort the model. Common methods to remove them are the IQR method, Z-score method, Visualization.

5.Feature Engineering 

Feature Engineering means creating new features from existing data that might help the model.
Program: 
df = pd.DataFrame({
    'Join_Date': pd.to_datetime(['2020-01-15', '2021-03-22', '2022-07-01'])
})

df['Year'] = df['Join_Date'].dt.year
df['Month'] = df['Join_Date'].dt.month
df['Day'] = df['Join_Date'].dt.day

print(df)

Output:
   Join_Date  Year  Month  Day
0 2020-01-15  2020      1   15
1 2021-03-22  2021      3   22
2 2022-07-01  2022      7    1

String Manipulation

String manipulation is a key part of data cleaning when working with textual or categorical data. It helps fix inconsistencies, remove noise, and prepare text for analysis or modeling.

⭐Common Use Cases:
  • Removing extra spaces
  • Converting to lowercase or uppercase
  • Replacing or removing unwanted characters
  • Extracting substrings
  • Splitting strings (e.g., "First Last" → ["First", "Last"])
  • Detecting or replacing missing/invalid values
  • Standardizing formats (e.g., emails, dates)
Indexing:
It means accessing a specific character in a string using its position.
Python uses 0-based indexing (first character is at position 0).

name = "Alice"
print(name[0])   # Output: 'A'
print(name[-1])  # Output: 'e' (last character)

Slicing:
Slicing means extracting a substring from a string using a range of indexes.

name = "Charlie"
print(name[0:3])   # Output: 'Cha' (from index 0 to 2)
print(name[2:])    # Output: 'arlie' (from index 2 to end)


Common String Methods :



Method Purpose
str.strip() Removes leading/trailing spaces from string
str.lower() / str.upper() Converts string to lowercase / uppercase
str.replace(old, new) Replaces substrings (e.g., '@''')
str.contains('text') Returns True/False if substring is present
str.split(delimiter) Splits string into list based on delimiter
str.len() Returns length of each string
str.extract(r'regex') Extracts pattern using regular expression
str.startswith('x') Returns True if string starts with given value
str.endswith('x') Returns True if string ends with given value
str.find('text') Returns index of first occurrence of substring (-1 if not found)
str.isalpha() Returns True if all characters are letters only
str.isalnum() Returns True if all characters are letters or numbers (alphanumeric)
str.isdigit() Returns True if string contains only digits
str.join(list) Joins elements of a list into a single string using the series as delimiter



Program:

import pandas as pd

# Raw Data
data = {
    'Name': [' sw@thi '],
    'Email': ['SwaTHI@GMAIL.Com']
}
df = pd.DataFrame(data)
print("🔹 Step 0: Original Data")
print(df, end="\n\n")

# Step 1: Remove spaces from Name using .str.strip()
df['Name_Step1'] = df['Name'].str.strip()
print("🔹 Step 1: After .str.strip() on Name")
print(df[['Name', 'Name_Step1']], end="\n\n")

# Step 2: Remove '@' from Name using .str.replace()
df['Name_Step2'] = df['Name_Step1'].str.replace('@', '', regex=False)
print("🔹 Step 2: After removing '@' from Name")
print(df[['Name_Step1', 'Name_Step2']], end="\n\n")

# Step 3: Capitalize Name using .str.capitalize()
df['Name_Clean'] = df['Name_Step2'].str.capitalize()
print("🔹 Step 3: Final Cleaned Name")
print(df[['Name_Step2', 'Name_Clean']], end="\n\n")

# Step 4: Lowercase Email using .str.lower()
df['Email_Step1'] = df['Email'].str.lower()
print("🔹 Step 4: Lowercase Email")
print(df[['Email', 'Email_Step1']], end="\n\n")

# Step 5: Replace 'gmail' with 'google' in Email
df['Email_Clean'] = df['Email_Step1'].str.replace('gmail', 'google', regex=False)
print("🔹 Step 5: After replacing 'gmail' with 'google'")
print(df[['Email_Step1', 'Email_Clean']], end="\n\n")

# Step 6: Extract domain using regex
df['Email_Domain'] = df['Email_Clean'].str.extract(r'@(.+)$')
print("🔹 Step 6: Extracted Domain from Email")
print(df[['Email_Clean', 'Email_Domain']], end="\n\n")

# Step 7: Indexing - First letter of name
df['First_Letter'] = df['Name_Clean'].str[0]
print("🔹 Step 7: First Letter of Cleaned Name")
print(df[['Name_Clean', 'First_Letter']], end="\n\n")

# Step 8: Slicing - First 3 letters of name
df['Name_Short'] = df['Name_Clean'].str[:3]
print("🔹 Step 8: First 3 Letters of Cleaned Name")
print(df[['Name_Clean', 'Name_Short']], end="\n\n")

# Step 9: Slicing - Last 3 letters of cleaned email
df['Email_End'] = df['Email_Clean'].str[-3:]
print("🔹 Step 9: Last 3 Letters of Cleaned Email")
print(df[['Email_Clean', 'Email_End']], end="\n\n")

Output:

🔹 Step 0: Original Data
       Name              Email
0   sw@thi   SwaTHI@GMAIL.Com

🔹 Step 1: After .str.strip() on Name
       Name     Name_Step1
0   sw@thi     sw@thi

🔹 Step 2: After removing '@' from Name
  Name_Step1 Name_Step2
0     sw@thi     swthi

🔹 Step 3: Final Cleaned Name
  Name_Step2     Name_Clean
0      swthi             Swthi

🔹 Step 4: Lowercase Email
             Email                         Email_Step1
0  SwaTHI@GMAIL.Com  swathi@gmail.com

🔹 Step 5: After replacing 'gmail' with 'google'
     Email_Step1           Email_Clean
0  swathi@gmail.com  swathi@google.com

🔹 Step 6: Extracted Domain from Email
       Email_Clean           Email_Domain
0  swathi@google.com   google.com

🔹 Step 7: First Letter of Cleaned Name
  Name_Clean     First_Letter
0     Swthi             S

🔹 Step 8: First 3 Letters of Cleaned Name
  Name_Clean Name_Short
0     Swthi        Swt

🔹 Step 9: Last 3 Letters of Cleaned Email
       Email_Clean             Email_End
0  swathi@google.com       com

Data Wrangling: Join

A join combines rows from two tables based on a common key column (like CustomerID, key, ProductCode, etc.).

Program:

import pandas as pd

# Left table
df1 = pd.DataFrame({
    'key': ['A', 'B', 'C', 'D'],
    'value1': [1, 2, 3, 4]
})

# Right table
df2 = pd.DataFrame({
    'key': ['B', 'C', 'D', 'E'],
    'value2': [5, 6, 7, 8]
})

print("df1:")
print(df1)

print("\ndf2:")
print(df2)

Output:

df1:
  key  value1
0   A       1
1   B       2
2   C       3
3   D       4

df2:
  key  value2
0   B       5
1   C       6
2   D       7
3   E       8

Inner Join

It keeps only the rows where the key exists in both DataFrames.

inner_join = pd.merge(df1, df2, on='key', how='inner')
print(inner_join)

Output:

  key  value1  value2
0   B       2       5
1   C       3       6
2   D       4       7

Outer Join

It keeps all rows from both DataFrames, fills missing with NaN.
outer_join = pd.merge(df1, df2, on='key', how='outer')
print(outer_join)

Output:

  key  value1  value2
0   A     1.0     NaN
1   B     2.0     5.0
2   C     3.0     6.0
3   D     4.0     7.0
4   E     NaN     8.0

Left Join

Keeps all rows from df1 (left), adds matching data from df2

left_join = pd.merge(df1, df2, on='key', how='left')
print(left_join)

Output:

  key  value1  value2
0   A       1     NaN
1   B       2     5.0
2   C       3     6.0
3   D       4     7.0

Right Join

Keeps all rows from df2 (right), adds matching data from df1.

right_join = pd.merge(df1, df2, on='key', how='right')
print(right_join)
 
Output:

  key  value1  value2
0   B     2.0       5
1   C     3.0       6
2   D     4.0       7
3   E     NaN       8

Merging with Multiple Keys 

Merging with multiple keys means combining DataFrames on two or more columns that together define the relationship between rows in both tables.

Program:

import pandas as pd

# Left DataFrame
df1 = pd.DataFrame({
    'key1': ['A', 'B', 'C', 'C'],
    'key2': [1, 2, 3, 4],
    'value1': ['apple', 'banana', 'cherry', 'date']
})

# Right DataFrame
df2 = pd.DataFrame({
    'key1': ['A', 'B', 'C', 'C'],
    'key2': [1, 2, 3, 5],
    'value2': ['red', 'yellow', 'dark red', 'brown']
})

print("df1:")
print(df1)

print("\ndf2:")
print(df2)

merged = pd.merge(df1, df2, on=['key1', 'key2'], how='inner')
print("\nMerged on key1 and key2:")
print(merged)


Output:

df1:
  key1  key2  value1
0    A     1   apple
1    B     2  banana
2    C     3  cherry
3    C     4    date

df2:
  key1  key2   value2
0    A     1      red
1    B     2   yellow
2    C     3  dark red
3    C     5    brown

Merged on key1 and key2:
  key1  key2  value1   value2
0    A     1   apple      red
1    B     2  banana   yellow
2    C     3  cherry  dark red

Merging Multiple Data Formats

When you merge a CSV and an Excel file, you're essentially combining them into a single DataFrame using a shared column, just like joining tables in SQL.
Suppose:
  • data.csv c ontains personal details (name, age)
  • data.xlsx contains salary info (city, salary)
You can merge them using a common column like key:

Program:

import pandas as pd
csv_df = pd.read_csv('data.csv')         # Load CSV file
excel_df = pd.read_excel('data.xlsx')    # Load Excel file

# Join on common column
merged_df = pd.merge(csv_df, excel_df, on='key', how='inner')  


Hierarchical Indexing

A MultiIndex (hierarchical index) in pandas allows you to have multiple levels of row or column labels in a single DataFrame.
This is useful when dealing with multi-dimensional data like (Product, Month), (Region, Year), etc., while still using a 2D table.

Example:

import pandas as pd
import numpy as np

# Define the row MultiIndex
index = pd.MultiIndex.from_tuples([
    ('Product A', 'January'),
    ('Product A', 'February'),
    ('Product B', 'January'),
    ('Product B', 'February'),
], names=['Product', 'Month'])

# Define the column MultiIndex
columns = pd.MultiIndex.from_tuples([
    ('East', 'Sales'),
    ('East', 'Profit'),
    ('West', 'Sales'),
    ('West', 'Profit'),
], names=['Region', 'Metric'])

# Create the DataFrame
data = [[100, 20, 120, 25],
        [110, 22, 130, 27],
        [90, 18, 100, 20],
        [95, 19, 105, 22]]

df = pd.DataFrame(data, index=index, columns=columns)
print(df)

Output:

Region                               East                West     
Metric                          Sales  Profit  Sales  Profit
Product       Month                            
Product A   January       100     20    120     25
                  February      110     22    130     27
Product B  January         90     18     100     20
                  February       95     19     105     22


Inspecting MultiIndex

print(df.index.names)         # ['Product', 'Month']
print(df.columns.names)       # ['Region', 'Metric']
print(df.index.levels)        # Index levels for rows
print(df.columns.levels)      # Index levels for columns

1️⃣ Select all rows for Product A

    print(df.loc['Product A'])

Output:

Region     East               West     
Metric    Sales Profit Sales Profit
Month                              
January     100     20   120     25
February    110     22   130     27

2️⃣ Select column ('West', 'Profit')

    print(df[('West', 'Profit')])

Output:

Product         Month     Profit
Product A    January      25
                    February    27
Product B    January     20
                    February    22
Name: (West, Profit), dtype: int64

3️⃣ Cross-section for Month = 'January'

print(df.xs('January', level='Month'))

Output:

    Region     East       West     
Metric    Sales Profit Sales Profit
Product                              
Product A   100     20   120     25
Product B    90     18   100     20

4️⃣ Cross-section on columns for Metric = 'Sales'

print(df.xs('Sales', level='Metric', axis=1)) 

Output:

Region                       East  West
Product       Month             
Product A   January   100   120
                  February  110   130
Product B  January    90   100
                  February   95   105

Grouping and Aggregation by Level

print(df.groupby(level='Product').sum())

Output:

Region           East        West     
Metric          Sales Profit Sales Profit
Product                                  
Product A         210     42   250     52
Product B         185     37   205     42

Combining and Merging Datasets

We can combine datasets in 2 ways:

1. Merging

merge() is used to join two DataFrames based on a common key column (like SQL joins). It allows combining datasets horizontally (side by side), matching rows using one or more columns.

Program:
import pandas as pd

# Left table
df1 = pd.DataFrame({
    'key': ['A', 'B', 'C', 'D'],
    'value1': [1, 2, 3, 4]
})

# Right table
df2 = pd.DataFrame({
    'key': ['B', 'C', 'D', 'E'],
    'value2': [5, 6, 7, 8]
})

print("df1:")
print(df1)

print("\ndf2:")
print(df2)

inner_join = pd.merge(df1, df2, on='key', how='inner')
print(inner_join)
Output:
df1:
  key  value1
0   A       1
1   B       2
2   C       3
3   D       4

df2:
  key  value2
0   B       5
1   C       6
2   D       7
3   E       8

  key  value1  value2
0   B       2       5
1   C       3       6
2   D       4       7 

2. Concatenation

concat() is used to stack multiple DataFrames either vertically (row-wise) or horizontally (column-wise). You don’t need to join on a key.

Program: 
dfA = pd.DataFrame({
    'month': ['Jan', 'Feb'],
    'sales': [100, 120]
})

dfB = pd.DataFrame({
    'month': ['Mar', 'Apr'],
    'sales': [90, 130]
})

# Concatenate rows
concat_rows = pd.concat([dfA, dfB], ignore_index=True)
print("Row-wise Concat:\n", concat_rows)

# Concatenate column-wise
concat_columns = pd.concat([dfA, dfB], axis=1)
print("\nColumn-wise Concat:\n", concat_columns)

Output:

 Row-wise Concat:
  month  sales
0   Jan    100
1   Feb    120
2   Mar     90
3   Apr    130

Column-wise Concat:
  month  sales month  sales
0   Jan    100   Mar     90
1   Feb    120   Apr    130

Reshaping Datasets

This is the process of converting datasets between wide and long formats to suit analysis or visualization needs.

1. Melting

It is the process of converting datasets from wide-form data (separate columns for each variable) into long-form, where each row is one observation.

Syntax:
pd.melt(dataframe, id_vars, value_vars, var_name, value_name)

Argument   
Description
dataframe
Your original wide DataFrame
id_vars
Columns to keep fixed (e.g., 'Name')
value_vars
Columns to melt (e.g., 'Python', 'C', 'Java')
var_name
New name for the variable column (e.g., 'Subject')
value_name
New name for the value column (e.g., 'Score')
 
Program:

import pandas as pd

# Wide format
df = pd.DataFrame({
    'Name': ['Swathi', 'Surya', 'Swamy', 'Chinnu'],
    'Python': [90, 85, 88, 95],
    'C': [80, 75, 89, 92],
    'Java': [85, 78, 90, 88]
})

# Melt to long format
df_melted = pd.melt(df,
                    id_vars=['Name'],
                    value_vars=['Python', 'C', 'Java'],
                    var_name='Subject',
                    value_name='Score')

print(df_melted)

Output:

     Name Subject  Score
0  Swathi  Python     90
1   Surya  Python     85
2   Swamy  Python     88
3  Chinnu  Python     95
4  Swathi       C     80
5   Surya       C     75
6   Swamy       C     89
7  Chinnu       C     92
8  Swathi    Java     85
9   Surya    Java     78
10  Swamy    Java     90
11 Chinnu    Java     88

2. Pivoting

 Pivoting reshapes data from long format back to wide format. It groups and arranges data into a table layout with rows and columns.

Syntax:
dataframe.pivot_table(index, columns, values)

Argument
Description
index
Column to use as row labels (e.g., 'Name')
columns
Column to turn into new columns (e.g., 'Subject')
values
Column with values to fill in the table (e.g., 'Score')
 

 Program:

# Pivot back to wide format
df_pivoted = df_melted.pivot_table(index='Name',
                                   columns='Subject',
                                   values='Score').reset_index()

print(df_pivoted)

Output:

Subject    Name   C  Java  Python
0        Chinnu  92    88      95
1         Surya  75    78      85
2         Swamy  89    90      88
3        Swathi  80    85      90

3. Stacking

It is the process of reorganising datasets to reshape the structure.
  • df.stack() turns columns into row levels (long format).
  • df.unstack() moves row index levels into columns (wide format).
Program:

# Create wide DataFrame and set index
df_month = pd.DataFrame({
    'City': ['Hyderabad', 'Bangalore'],
    'Jan': [100, 150],
    'Feb': [120, 160]
}).set_index('City')

print("Original Wide Format:")
print(df_month)

# Stack
stacked = df_month.stack()
print("\nStacked (Long Format):")
print(stacked)

# Unstack
unstacked = stacked.unstack()
print("\nUnstacked (Back to Wide):")
print(unstacked)

Output:

Original Wide Format:

                    Jan  Feb
City                
Hyderabad   100  120
Bangalore   150  160

Stacked (Long Format):

City               Month
Hyderabad  Jan      100
                    Feb      120
Bangalore   Jan      150
                   Feb      160
dtype: int64

Unstacked (Back to Wide):

                      Jan  Feb
City                
Hyderabad   100  120
Bangalore    150  160

Advantages
  • Easy to merge data from different sources → e.g., sales + customers.
  • Reshape data for visualization (pivot charts).
  • Helps discover insights (patterns, trends).
Disadvantages
  • Wrong merge keys → data loss.
  • Complex reshaping can confuse beginners.
  • Large datasets = high memory usage.
Applications
  • Business Intelligence: Merge sales + customer demographics.
  • Finance: Consolidate balance sheet + cash flow.
  • Healthcare: Combine patient data from multiple systems.
  • Market Research: Merge survey + consumer data.













Comments

Popular posts from this blog

Introduction to Data Analysis & Python

Introduction to Pandas and Data Loading

Fundamentals of Python Programming & Numpy