Data Cleaning and Preparation

Handling Missing Values

Missing values are entries in your dataset that have no recorded data. In Python and Pandas, missing values are typically represented as:

NaN (Not a Number) — from NumPy
None — a Python built-in null object

Example:

import pandas as pd

import numpy as np

data = {

"Name": ["Ramya", None, "Swathi"],

"Age": [24, 28, np.nan],

"City": ["Chennai", "Delhi", None]

}

df = pd.DataFrame(data)

print(df)

Output:

Name Age City

0 Ramya 24.0 Chennai

1 None 28.0 Delhi

2 Swathi NaN None

Why Should We Handle or Remove Missing Values?

Missing data can significantly affect the quality and outcome of data analysis or machine learning models.

1. Accuracy and Validity

If missing values are ignored, calculations like mean, sum, correlation, etc., can give incorrect results.

2. Model Errors in Machine Learning

Most machine learning algorithms cannot work with missing values. They will throw errors unless the data is cleaned.

3. Bias in Analysis

If missing values are unevenly distributed across categories (e.g., mostly missing from one city or gender), it can lead to biased insights.

4. Data Integrity

Missing data can indicate issues in data collection, sensors, user input, or system bugs. It's important to understand the cause.

Detecting missing values

Missing values often occur due to incomplete data collection, errors during data entry, or system-related issues. To work effectively with your data, the first step is to detect where these missing values occur.

Function	Description
`df.isnull()`	Returns a DataFrame of the same shape with `True` where data is missing
`df.notnull()`	Returns `True` where data is not missing
`df.isnull().sum()`	Sums the number of `NaN` values in each column

Example:

import pandas as pd

import numpy as np

# Sample dataset with some missing values

data = {

"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish", None],

"Age": [23, np.nan, 27, 24, None, 29],

"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore", None]

}

df = pd.DataFrame(data)

# Display the DataFrame

print("Original DataFrame:")

print(df)

# Check for missing values (True for NaN or None)

print("\nDetecting missing values with df.isnull():")

print(df.isnull())

# Count total missing values per column

print("\nMissing values per column:")

print(df.isnull().sum())

# Detect non-missing (valid) values

print("\nDetecting non-missing values with df.notnull():")

print(df.notnull())

Output:

Original DataFrame:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi NaN Chennai

2 Lalitha 27.0 None

3 Lucky 24.0 Delhi

4 Harish NaN Bangalore

5 None 29.0 None

Detecting missing values with df.isnull():

Name Age City

0 False False False

1 False True False

2 False False True

3 False False False

4 False True False

5 True False True

Missing values per column:

Name 1

Age 2

City 2

dtype: int64

Detecting non-missing values with df.notnull():

Name Age City

0 True True True

1 True False True

2 True True False

3 True True True

4 True False True

5 False True False

Handling Missing Data Techniques

1. Deletion

Once missing values are detected, the next step is often to remove them, either rows or columns, depending on your analysis goals.

Why Remove Missing Values?

When missing values are few and do not significantly affect the dataset.
When you don't want to fill them with default or estimated values.
To ensure compatibility with tools/models that don’t accept NaNs.

Example:

import pandas as pd

import numpy as np

data = {

"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],

"Age": [23, np.nan, 27, 24, None],

"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]

}

df = pd.DataFrame(data)

print("Original DataFrame:")

print(df)

Output:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi NaN Chennai

2 Lalitha 27.0 None

3 Lucky 24.0 Delhi

4 Harish NaN Bangalore

1. Remove Rows with Any Missing Values

df_any = df.dropna(how='any')

print("\nRows with any missing values removed:")

print(df_any)

Removes a row if even one value is missing.

Rows 1, 2, and 4 contain at least one NaN → they get dropped.

Output:

Name Age City

0 Ramya 23.0 Hyderabad

3 Lucky 24.0 Delhi

2. Remove Rows with All Missing Values

Example:

import pandas as pd

import numpy as np

data = {

"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish", None],

"Age": [23, np.nan, 27, 24, None, None],

"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore", None]

}

df = pd.DataFrame(data)

print("Original DataFrame:")

print(df)

df_all = df.dropna(how='all')

print("\nRows with all missing values removed:")

print(df_all)

Removes a row only if all values are missing.

5th row have all values missing -> they got dropped.

Output:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi NaN Chennai

2 Lalitha 27.0 None

3 Lucky 24.0 Delhi

4 Harish NaN Bangalore

3. Drop Columns with Any Missing Values

Example:

import pandas as pd

import numpy as np

data = {

"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],

"Age": [23, np.nan, 27, 24, None],

"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]

}

df = pd.DataFrame(data)

print("Original DataFrame:")

print(df)

df_col_any = df.dropna(axis=1, how='any')

print("\nColumns with any missing values removed:")

print(df_col_any)

Removes a column if any row contains a missing value.

Both Age and City have missing values → they are dropped.

Only Name remains.

Output:

Name

0 Ramya

1 Swathi

2 Lalitha

3 Lucky

4 Harish

4. Drop Columns with All Missing Values

df_col_all = df.dropna(axis=1, how='all')

print("\nColumns with all missing values removed:")

print(df_col_all)

Removes a column only if all values in it are missing.

No column has all missing values → nothing is dropped.

Output:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi NaN Chennai

2 Lalitha 27.0 None

3 Lucky 24.0 Delhi

4 Harish NaN Bangalore

2. Imputation

Imputation is the process of replacing missing data with substitute values so that the dataset remains usable for analysis or modeling.

Why Use Imputation?

Maintains the size of the dataset (no loss of rows)
Avoids errors during model training (many ML models can't handle NaN).
Helps preserve important patterns in the data.
Better than dropping data, especially when the missing values are few.

1. Mean Imputation (For numeric columns)

Replace missing values with the mean of the column.

Use when data is numeric and not heavily skewed (no outliers).

Example:

import pandas as pd

import numpy as np

data = {

"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],

"Age": [23, np.nan, 27, 24, None],

"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]

}

df = pd.DataFrame(data)

print("Original Data:")

print(df)

df['Age'].fillna(df['Age'].mean(), inplace=True)

print("Data after mean Imputation:")

print(df)

Output:

Original Data:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi NaN Chennai

2 Lalitha 27.0 None

3 Lucky 24.0 Delhi

4 Harish NaN Bangalore

Data after mean Imputation:

Name Age City

0 Ramya 23.000000 Hyderabad

1 Swathi 24.666667 Chennai

2 Lalitha 27.000000 None

3 Lucky 24.000000 Delhi

4 Harish 24.666667 Bangalore

2. Mode Imputation (for categorical columns)

Fill missing string values with the most frequent value (mode).

Use for categorical columns (City, Name) when you want to fill with the most common value.

df['City'].fillna(df['City'].mode()[0], inplace=True)

print(df)

Output:

Name Age City

0 Ramya 23.000000 Hyderabad

1 Swathi 24.666667 Chennai

2 Lalitha 27.000000 Chennai

3 Lucky 24.000000 Delhi

4 Harish 24.666667 Bangalore

3. Median Imputation

Replace missing values with the median of that column.

Use when numeric data has outliers (median is more robust).

# Reset original data

df = pd.DataFrame(data)

df['Age'].fillna(df['Age'].median(), inplace=True)

print(df)

Output:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi 24.0 Chennai

2 Lalitha 27.0 None

3 Lucky 24.0 Delhi

4 Harish 24.0 Bangalore

4. Constant Value Imputation

You can fill missing values with fixed placeholders like "Unknown" or 0.

Use when you want to mark missing values clearly.

# Fill missing values in Age and City

df = pd.DataFrame(data)

df.fillna({"Age": 0, "City": "Unknown"}, inplace=True)

print(df)

Output:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi 0.0 Chennai

2 Lalitha 27.0 Unknown

3 Lucky 24.0 Delhi

4 Harish 0.0 Bangalore

5. Forward Fill

You can fill missing values from the previous row.

Use for time series or where previous values can be logically carried forward.

df = pd.DataFrame(data)

df.fillna(method='ffill', inplace=True)

print(df)

Output:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi 23.0 Chennai

2 Lalitha 27.0 Chennai

3 Lucky 24.0 Delhi

4 Harish 24.0 Bangalore

6. Backward Fill

You can fill in the missing values from the next row.

df = pd.DataFrame(data)

df.fillna(method='bfill', inplace=True)

print(df)

Output:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi 27.0 Chennai

2 Lalitha 27.0 Delhi

3 Lucky 24.0 Delhi

4 Harish NaN Bangalore

7. Interpolation

Interpolation is a technique used to estimate missing values by using the values before and after the missing one. Instead of filling NaN with static values (like mean or mode), interpolation intelligently estimates what the missing value should be based on existing data.

import pandas as pd

import numpy as np

data = {

"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],

"Age": [23, np.nan, 27, 24, None],

"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]

}

df = pd.DataFrame(data)

print("Original Data:")

print(df)

# Interpolate missing Age values

df['Age'] = df['Age'].interpolate(method='linear')

print("After Interpolation:")

print(df)

Output:

Original Data:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi NaN Chennai

2 Lalitha 27.0 None

3 Lucky 24.0 Delhi

4 Harish NaN Bangalore

After Interpolation:

Name Age City

0 Ramya 23.0 Hyderabad

1 Swathi 25.0 Chennai

2 Lalitha 27.0 None

3 Lucky 24.0 Delhi

4 Harish 24.0 Bangalore

Outliers

An outlier is a data point that is very different (too high or too low) compared to the rest of the data.

Example:

Salaries = [30k, 35k, 32k, 34k, 5 crore]

Here, 5 crore is an outlier.

Outliers can disturb averages, scaling, and machine learning models. So sometimes we remove or treat them.

Outlier Detection techniques:

1. IQR Method (Interquartile Range)

Based on percentiles:

Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 − Q1
Outlier limits:
- Lower = Q1 − 1.5 × IQR
- Upper = Q3 + 1.5 × IQR

IQR is more robust for skewed or non-normal data.

Example:

import pandas as pd

# Step 1: Create the DataFrame

data = {'Salary': [30000, 32000, 35000, 34000, 37000, 36000, 39000, 41000, 43000, 5000000]}

df = pd.DataFrame(data)

# Step 2: Calculate Q1, Q3, and IQR

Q1 = df['Salary'].quantile(0.25)

Q3 = df['Salary'].quantile(0.75)

IQR = Q3 - Q1

# Step 3: Define outlier limits

lower_limit = Q1 - 1.5 * IQR

upper_limit = Q3 + 1.5 * IQR

# Step 4: Filter out outliers

df_iqr = df[(df['Salary'] >= lower_limit) & (df['Salary'] <= upper_limit)]

print("Cleaned Data using IQR:")

print(df_iqr)

Output:

Cleaned Data using IQR:

Salary

0 30000

1 32000

2 35000

3 34000

4 37000

5 36000

6 39000

7 41000

8 43000

Step-by-Step Manual Calculation

Sort the data in ascending order:


[30000, 32000, 34000, 35000, 36000, 37000, 39000, 41000, 43000, 5000000]

There are 10 values (even number), so:
- Lower half (first 5 values):
  [30000, 32000, 34000, 35000, 36000]
  → Q1 = middle value = 34000
- Upper half (last 5 values):
  [37000, 39000, 41000, 43000, 5000000]
  → Q3 = middle value = 41000
IQR (Interquartile Range) = Q3 − Q1
→ IQR = 41000 − 34000 = 7000
Outlier limits:
- Lower limit = Q1 − 1.5 × IQR
  → 34000 − 1.5 × 7000 = 34000 − 10500 = 23500
- Upper limit = Q3 + 1.5 × IQR
  → 41000 + 1.5 × 7000 = 41000 + 10500 = 51500
Conclusion:
- Any value below 23500 or above 51500 is an outlier.
- In this case, 5000000 is an outlier, and all other values are within range.

2. Z-score

Z-score measures how many standard deviations a value is from the mean.

Z-score works better for normally distributed data.

Formula:

z = \frac{x - \text{mean}}{\text{std}}

Values with z > 3 or z < -3 are usually considered outliers (can be adjusted depending on the use case).

Example:

from scipy import stats

import numpy as np

# Step 1: Calculate Z-scores

z_scores = np.abs(stats.zscore(df['Salary']))

# Step 2: Filter based on threshold

df_z_clean = df[(z_scores < 3)]

print("Cleaned Data using Z-Score:")

print(df_z_clean)

Output:

Cleaned Data using Z-Score:

Salary

0 30000

1 32000

2 35000

3 34000

4 37000

5 36000

6 39000

7 41000

8 43000

Calculation:

Mean ≈ 552000

Standard Deviation ≈ 1,470,000

Outlier threshold = ±3 standard deviations

Outlier detected: 5000000 (Z ≈ 3.01)

3. Visualization (Boxplot)

Visualizations are a quick way to see outliers.

Example:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

# Define salary data

Salaries = [30000, 32000, 35000, 34000, 37000, 36000, 39000, 41000, 43000, 5000000]

# Create DataFrame with column name 'Salary'

df = pd.DataFrame({'Salary': Salaries})

# Plot boxplot

plt.boxplot(df['Salary'])

plt.title("Salary Distribution (with Outlier)")

plt.ylabel("Salary")

plt.show()

Output:

The box spans from Q1 (34000) to Q3 (41000).

The line inside the box is the median (around 36500).

The dot above the box represents the outlier 5000000.

Data Transformation

Data transformation refers to the process of converting raw data into a suitable format that enables a computer (such as a machine learning model, database, or analysis tool) to understand, compare, and utilize it effectively.

⭐Raw data often has problems:

Different scales (e.g., marks out of 100 vs age in years).

Student Math (out of 100) Height (cm)

A 90 150

B 80 170

C 70 160

If you directly give this table to a computer (machine learning model):

Math is between 70–90.
Height is between 150–170.

So the computer thinks height is more important (bigger numbers) than math marks, even though we care about both equally. So, without scaling features with large values dominate.

Different formats (e.g., text “A/B/C” vs numbers).

Computers don’t understand “A, B, C” — they need numbers.

Missing values.
Outliers (extreme values).

Data Transformation Techniques

1. Scaling / Normalization / Standardization

Scaling is used to bring all numerical values to a common range, especially when features have different units (like marks and age).

Types:

Normalization (Min-Max Scaling): scales values between 0 and 1

Formula for Normalization

x^{'} = \frac{x - min}{max - min ​}

Where:
x = each value in the column
min = smallest value
max = largest value
👉 This transformation scales all values between 0 and 1.

For example, numbers = [10, 20, 30, 40, 50]
min=10
max=50

Now apply the formula:

Value	Calculation	Result
10	(10 − 10) / (50 − 10) = 0 / 40	0.00
20	(20 − 10) / 40 = 10 / 40	0.25
30	(30 − 10) / 40 = 20 / 40	0.50
40	(40 − 10) / 40 = 30 / 40	0.75
50	(50 − 10) / 40 = 40 / 40	1.00

All values are now between 0 and 1.

This is useful for algorithms like KNN, SVM, and Neural Networks that are sensitive to scale.

✅ Use Min–Max Scaling when you know the min and max are fixed and outliers are not a big issue.

Program:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample data
df = pd.DataFrame({
'Marks': [40, 60, 80, 100],
'Age': [20, 25, 30, 35]
})

# Apply Min-Max Scaling
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)

df_scaled = pd.DataFrame(scaled, columns=['Marks', 'Age'])
print("Scaled Data:")
print(df_scaled)

Output:

Scaled Data:
Marks Age
0 0.0 0.00
1 0.5 0.25
2 0.9 0.50
3 1.0 0.75

Standardization (Z-score): transforms values to have mean = 0 and std = 1

This makes data centered and scaled, which is especially useful for models that assume a normal distribution (e.g., logistic regression, SVM, etc.)

Formula for Standardization

x^{'} = \frac{x - μ}{σ​}

Where:
x = each value
μ = mean of the data
σ = standard deviation of the data

For example, numbers = [10, 20, 30, 40, 50]
mean= 30
σ ≈ 14.14

Now apply the formula:

x' = \frac{x - 30}{14.14}

Original	Calculation	Z-score (Standardized)
10	(10 - 30) / 14.14 = -20 / 14.14	-1.41
20	(20 - 30) / 14.14 = -10 / 14.14	-0.71
30	(30 - 30) / 14.14 = 0	0.00
40	(40 - 30) / 14.14 = 10 / 14.14	0.71
50	(50 - 30) / 14.14 = 20 / 14.14	1.41

Z-score scaling centers the data and removes scale-related bias.

Program:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Step 1: Create the DataFrame
data = {'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Step 2: Apply Standardization (Z-score)
scaler = StandardScaler()
df['Standardized'] = scaler.fit_transform(df[['Value']])

# Step 3: Print Result
print("Standardized Data:")
print(df)

Output:
Value Standardized
0 10 -1.414214
1 20 -0.707107
2 30 0.000000
3 40 0.707107
4 50 1.414214

2. Encoding

Machine learning models need numeric data, so text (categorical) values must be encoded.

⦿ Label Encoding

Label Encoding converts categorical text values into numeric labels (integers).
Each unique category is assigned a unique integer.

Example:

Input: ['Red', 'Green', 'Blue']
Encoded: [ 2 , 1 , 0 ]

Program:

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Grade': ['A', 'B', 'A', 'C']})
encoder = LabelEncoder()
df['Grade_Encoded'] = encoder.fit_transform(df['Grade'])
print(df)

Output:

Grade Grade_Encoded
0 A 0
1 B 1
2 A 0
3 C 2

⦿ One-Hot Encoding

One-Hot Encoding converts each category into a binary vector (0s and 1s).
Each unique category becomes its own column with 1 indicating presence.

Example:

Input: ['Red', 'Green', 'Blue']

Encoded:
Red Green Blue
0 1 0 0
1 0 1 0
2 0 0 1

Program:

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Grade': ['A', 'B', 'A', 'C']})

df = pd.DataFrame({'Grade': ['A', 'B', 'A', 'C']})
df_one_hot = pd.get_dummies(df, columns=['Grade'])
print(df_one_hot)

Output:

Grade_A Grade_B Grade_C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1

3. Handling Missing Values

Missing values (NaN) can break models. They should be filled (imputed) using strategies like mean, median, mode, or interpolation.

4. Removing Outliers

Outliers are extreme values that can distort the model. Common methods to remove them are the IQR method, Z-score method, Visualization.

5.Feature Engineering

Feature Engineering means creating new features from existing data that might help the model.

Program:

df = pd.DataFrame({
'Join_Date': pd.to_datetime(['2020-01-15', '2021-03-22', '2022-07-01'])
})

df['Year'] = df['Join_Date'].dt.year
df['Month'] = df['Join_Date'].dt.month
df['Day'] = df['Join_Date'].dt.day

print(df)

Output:

Join_Date Year Month Day
0 2020-01-15 2020 1 15
1 2021-03-22 2021 3 22
2 2022-07-01 2022 7 1

String Manipulation

String manipulation is a key part of data cleaning when working with textual or categorical data. It helps fix inconsistencies, remove noise, and prepare text for analysis or modeling.

⭐Common Use Cases:

Removing extra spaces
Converting to lowercase or uppercase
Replacing or removing unwanted characters
Extracting substrings
Splitting strings (e.g., "First Last" → ["First", "Last"])
Detecting or replacing missing/invalid values
Standardizing formats (e.g., emails, dates)

Indexing:

It means accessing a specific character in a string using its position.
Python uses 0-based indexing (first character is at position 0).

name = "Alice"
print(name[0]) # Output: 'A'
print(name[-1]) # Output: 'e' (last character)

Slicing:

Slicing means extracting a substring from a string using a range of indexes.

name = "Charlie"
print(name[0:3]) # Output: 'Cha' (from index 0 to 2)
print(name[2:]) # Output: 'arlie' (from index 2 to end)

⭐Common String Methods :

Method	Purpose
`str.strip()`	Removes leading/trailing spaces from string
`str.lower()` / `str.upper()`	Converts string to lowercase / uppercase
`str.replace(old, new)`	Replaces substrings (e.g., `'@'` → `''`)
`str.contains('text')`	Returns `True`/`False` if substring is present
`str.split(delimiter)`	Splits string into list based on delimiter
`str.len()`	Returns length of each string
`str.extract(r'regex')`	Extracts pattern using regular expression
`str.startswith('x')`	Returns `True` if string starts with given value
`str.endswith('x')`	Returns `True` if string ends with given value
`str.find('text')`	Returns index of first occurrence of substring (`-1` if not found)
`str.isalpha()`	Returns `True` if all characters are letters only
`str.isalnum()`	Returns `True` if all characters are letters or numbers (alphanumeric)
`str.isdigit()`	Returns `True` if string contains only digits
`str.join(list)`	Joins elements of a list into a single string using the series as delimiter

Program:

import pandas as pd

# Raw Data

data = {

'Name': [' sw@thi '],

'Email': ['SwaTHI@GMAIL.Com']

}

df = pd.DataFrame(data)

print("🔹 Step 0: Original Data")

print(df, end="\n\n")

# Step 1: Remove spaces from Name using .str.strip()

df['Name_Step1'] = df['Name'].str.strip()

print("🔹 Step 1: After .str.strip() on Name")

print(df[['Name', 'Name_Step1']], end="\n\n")

# Step 2: Remove '@' from Name using .str.replace()

df['Name_Step2'] = df['Name_Step1'].str.replace('@', '', regex=False)

print("🔹 Step 2: After removing '@' from Name")

print(df[['Name_Step1', 'Name_Step2']], end="\n\n")

# Step 3: Capitalize Name using .str.capitalize()

df['Name_Clean'] = df['Name_Step2'].str.capitalize()

print("🔹 Step 3: Final Cleaned Name")

print(df[['Name_Step2', 'Name_Clean']], end="\n\n")

# Step 4: Lowercase Email using .str.lower()

df['Email_Step1'] = df['Email'].str.lower()

print("🔹 Step 4: Lowercase Email")

print(df[['Email', 'Email_Step1']], end="\n\n")

# Step 5: Replace 'gmail' with 'google' in Email

df['Email_Clean'] = df['Email_Step1'].str.replace('gmail', 'google', regex=False)

print("🔹 Step 5: After replacing 'gmail' with 'google'")

print(df[['Email_Step1', 'Email_Clean']], end="\n\n")

# Step 6: Extract domain using regex

df['Email_Domain'] = df['Email_Clean'].str.extract(r'@(.+)$')

print("🔹 Step 6: Extracted Domain from Email")

print(df[['Email_Clean', 'Email_Domain']], end="\n\n")

# Step 7: Indexing - First letter of name

df['First_Letter'] = df['Name_Clean'].str[0]

print("🔹 Step 7: First Letter of Cleaned Name")

print(df[['Name_Clean', 'First_Letter']], end="\n\n")

# Step 8: Slicing - First 3 letters of name

df['Name_Short'] = df['Name_Clean'].str[:3]

print("🔹 Step 8: First 3 Letters of Cleaned Name")

print(df[['Name_Clean', 'Name_Short']], end="\n\n")

# Step 9: Slicing - Last 3 letters of cleaned email

df['Email_End'] = df['Email_Clean'].str[-3:]

print("🔹 Step 9: Last 3 Letters of Cleaned Email")

print(df[['Email_Clean', 'Email_End']], end="\n\n")

Output:

🔹 Step 0: Original Data

Name Email

0 sw@thi SwaTHI@GMAIL.Com

🔹 Step 1: After .str.strip() on Name

Name Name_Step1

0 sw@thi sw@thi

🔹 Step 2: After removing '@' from Name

Name_Step1 Name_Step2

0 sw@thi swthi

🔹 Step 3: Final Cleaned Name

Name_Step2 Name_Clean

0 swthi Swthi

🔹 Step 4: Lowercase Email

Email Email_Step1

0 SwaTHI@GMAIL.Com swathi@gmail.com

🔹 Step 5: After replacing 'gmail' with 'google'

Email_Step1 Email_Clean

0 swathi@gmail.com swathi@google.com

🔹 Step 6: Extracted Domain from Email

Email_Clean Email_Domain

0 swathi@google.com google.com

🔹 Step 7: First Letter of Cleaned Name

Name_Clean First_Letter

0 Swthi S

🔹 Step 8: First 3 Letters of Cleaned Name

Name_Clean Name_Short

0 Swthi Swt

🔹 Step 9: Last 3 Letters of Cleaned Email

Email_Clean Email_End

0 swathi@google.com com

Data Wrangling: Join

A join combines rows from two tables based on a common key column (like CustomerID, key, ProductCode, etc.).

Program:

import pandas as pd

# Left table

df1 = pd.DataFrame({

'key': ['A', 'B', 'C', 'D'],

'value1': [1, 2, 3, 4]

})

# Right table

df2 = pd.DataFrame({

'key': ['B', 'C', 'D', 'E'],

'value2': [5, 6, 7, 8]

})

print("df1:")

print(df1)

print("\ndf2:")

print(df2)

Output:

df1:

key value1

0 A 1

1 B 2

2 C 3

3 D 4

df2:

key value2

0 B 5

1 C 6

2 D 7

3 E 8

Inner Join

It keeps only the rows where the key exists in both DataFrames.

inner_join = pd.merge(df1, df2, on='key', how='inner')
print(inner_join)

Output:

key value1 value2

0 B 2 5

1 C 3 6

2 D 4 7

Outer Join

It keeps all rows from both DataFrames, fills missing with NaN.

outer_join = pd.merge(df1, df2, on='key', how='outer')

print(outer_join)

Output:

key value1 value2

0 A 1.0 NaN

1 B 2.0 5.0

2 C 3.0 6.0

3 D 4.0 7.0

4 E NaN 8.0

Left Join

Keeps all rows from df1 (left), adds matching data from df2.

left_join = pd.merge(df1, df2, on='key', how='left')

print(left_join)

Output:

key value1 value2

0 A 1 NaN

1 B 2 5.0

2 C 3 6.0

3 D 4 7.0

Right Join

Keeps all rows from df2 (right), adds matching data from df1.

right_join = pd.merge(df1, df2, on='key', how='right')

print(right_join)

Output:

key value1 value2

0 B 2.0 5

1 C 3.0 6

2 D 4.0 7

3 E NaN 8

Merging with Multiple Keys

Merging with multiple keys means combining DataFrames on two or more columns that together define the relationship between rows in both tables.

Program:

import pandas as pd

# Left DataFrame

df1 = pd.DataFrame({

'key1': ['A', 'B', 'C', 'C'],

'key2': [1, 2, 3, 4],

'value1': ['apple', 'banana', 'cherry', 'date']

})

# Right DataFrame

df2 = pd.DataFrame({

'key1': ['A', 'B', 'C', 'C'],

'key2': [1, 2, 3, 5],

'value2': ['red', 'yellow', 'dark red', 'brown']

})

print("df1:")

print(df1)

print("\ndf2:")

print(df2)

merged = pd.merge(df1, df2, on=['key1', 'key2'], how='inner')

print("\nMerged on key1 and key2:")

print(merged)

Output:

df1:

key1 key2 value1

0 A 1 apple

1 B 2 banana

2 C 3 cherry

3 C 4 date

df2:

key1 key2 value2

0 A 1 red

1 B 2 yellow

2 C 3 dark red

3 C 5 brown

Merged on key1 and key2:

key1 key2 value1 value2

0 A 1 apple red

1 B 2 banana yellow

2 C 3 cherry dark red

Merging Multiple Data Formats

When you merge a CSV and an Excel file, you're essentially combining them into a single DataFrame using a shared column, just like joining tables in SQL.

Suppose:

data.csv c ontains personal details (name, age)
data.xlsx contains salary info (city, salary)

You can merge them using a common column like key:

Program:

import pandas as pd

csv_df = pd.read_csv('data.csv') # Load CSV file

excel_df = pd.read_excel('data.xlsx') # Load Excel file

# Join on common column

merged_df = pd.merge(csv_df, excel_df, on='key', how='inner')

Hierarchical Indexing

A MultiIndex (hierarchical index) in pandas allows you to have multiple levels of row or column labels in a single DataFrame.

This is useful when dealing with multi-dimensional data like (Product, Month), (Region, Year), etc., while still using a 2D table.

Example:

import pandas as pd

import numpy as np

# Define the row MultiIndex

index = pd.MultiIndex.from_tuples([

('Product A', 'January'),

('Product A', 'February'),

('Product B', 'January'),

('Product B', 'February'),

], names=['Product', 'Month'])

# Define the column MultiIndex

columns = pd.MultiIndex.from_tuples([

('East', 'Sales'),

('East', 'Profit'),

('West', 'Sales'),

('West', 'Profit'),

], names=['Region', 'Metric'])

# Create the DataFrame

data = [[100, 20, 120, 25],

[110, 22, 130, 27],

[90, 18, 100, 20],

[95, 19, 105, 22]]

df = pd.DataFrame(data, index=index, columns=columns)

print(df)

Output:

Region East West

Metric Sales Profit Sales Profit

Product Month

Product A January 100 20 120 25

February 110 22 130 27

Product B January 90 18 100 20

February 95 19 105 22

◘ Inspecting MultiIndex

print(df.index.names) # ['Product', 'Month']

print(df.columns.names) # ['Region', 'Metric']

print(df.index.levels) # Index levels for rows

print(df.columns.levels) # Index levels for columns

1️⃣ Select all rows for Product A

print(df.loc['Product A'])

Output:

Region East West

Metric Sales Profit Sales Profit

Month

January 100 20 120 25

February 110 22 130 27

2️⃣ Select column ('West', 'Profit')

print(df[('West', 'Profit')])

Output:

Product Month Profit

Product A January 25

February 27

Product B January 20

February 22

Name: (West, Profit), dtype: int64

3️⃣ Cross-section for Month = 'January'

print(df.xs('January', level='Month'))

Output:

Region East West

Metric Sales Profit Sales Profit

Product

Product A 100 20 120 25

Product B 90 18 100 20

4️⃣ Cross-section on columns for Metric = 'Sales'

print(df.xs('Sales', level='Metric', axis=1))

Output:

Region East West

Product Month

Product A January 100 120

February 110 130

Product B January 90 100

February 95 105

Grouping and Aggregation by Level

print(df.groupby(level='Product').sum())

Output:

Region East West

Metric Sales Profit Sales Profit

Product

Product A 210 42 250 52

Product B 185 37 205 42

Combining and Merging Datasets

We can combine datasets in 2 ways:

1. Merging

merge() is used to join two DataFrames based on a common key column (like SQL joins). It allows combining datasets horizontally (side by side), matching rows using one or more columns.

Program:

import pandas as pd

# Left table
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
'value1': [1, 2, 3, 4]
})

# Right table
df2 = pd.DataFrame({
'key': ['B', 'C', 'D', 'E'],
'value2': [5, 6, 7, 8]
})

print("df1:")
print(df1)

print("\ndf2:")
print(df2)

inner_join = pd.merge(df1, df2, on='key', how='inner')

print(inner_join)

Output:

df1:

key value1
0 A 1
1 B 2
2 C 3
3 D 4

df2:
key value2
0 B 5
1 C 6
2 D 7
3 E 8

key value1 value2
0 B 2 5
1 C 3 6
2 D 4 7

2. Concatenation

concat() is used to stack multiple DataFrames either vertically (row-wise) or horizontally (column-wise). You don’t need to join on a key.

Program:

dfA = pd.DataFrame({
'month': ['Jan', 'Feb'],
'sales': [100, 120]
})

dfB = pd.DataFrame({
'month': ['Mar', 'Apr'],
'sales': [90, 130]
})

# Concatenate rows
concat_rows = pd.concat([dfA, dfB], ignore_index=True)
print("Row-wise Concat:\n", concat_rows)

# Concatenate column-wise

concat_columns = pd.concat([dfA, dfB], axis=1)
print("\nColumn-wise Concat:\n", concat_columns)

Output:

Row-wise Concat:
month sales
0 Jan 100
1 Feb 120
2 Mar 90
3 Apr 130

Column-wise Concat:
month sales month sales
0 Jan 100 Mar 90
1 Feb 120 Apr 130

Reshaping Datasets

This is the process of converting datasets between wide and long formats to suit analysis or visualization needs.

1. Melting

It is the process of converting datasets from wide-form data (separate columns for each variable) into long-form, where each row is one observation.

Syntax:

pd.melt(dataframe, id_vars, value_vars, var_name, value_name)

Argument	Description
`dataframe`	Your original wide DataFrame
`id_vars`	Columns to keep fixed (e.g., 'Name')
`value_vars`	Columns to melt (e.g., 'Python', 'C', 'Java')
`var_name`	New name for the variable column (e.g., 'Subject')
`value_name`	New name for the value column (e.g., 'Score')

Program:

import pandas as pd

# Wide format
df = pd.DataFrame({
'Name': ['Swathi', 'Surya', 'Swamy', 'Chinnu'],
'Python': [90, 85, 88, 95],
'C': [80, 75, 89, 92],
'Java': [85, 78, 90, 88]
})

# Melt to long format
df_melted = pd.melt(df,
id_vars=['Name'],
value_vars=['Python', 'C', 'Java'],
var_name='Subject',
value_name='Score')

print(df_melted)

Output:

Name Subject Score
0 Swathi Python 90
1 Surya Python 85
2 Swamy Python 88
3 Chinnu Python 95
4 Swathi C 80
5 Surya C 75
6 Swamy C 89
7 Chinnu C 92
8 Swathi Java 85
9 Surya Java 78
10 Swamy Java 90
11 Chinnu Java 88

2. Pivoting

Pivoting reshapes data from long format back to wide format. It groups and arranges data into a table layout with rows and columns.

Syntax:

dataframe.pivot_table(index, columns, values)

Argument	Description
`index`	Column to use as row labels (e.g., 'Name')
`columns`	Column to turn into new columns (e.g., 'Subject')
`values`	Column with values to fill in the table (e.g., 'Score')

Program:

# Pivot back to wide format

df_pivoted = df_melted.pivot_table(index='Name',

columns='Subject',

values='Score').reset_index()

print(df_pivoted)

Output:

Subject Name C Java Python

0 Chinnu 92 88 95

1 Surya 75 78 85

2 Swamy 89 90 88

3 Swathi 80 85 90

3. Stacking

It is the process of reorganising datasets to reshape the structure.

df.stack() turns columns into row levels (long format).
df.unstack() moves row index levels into columns (wide format).

Program:

# Create wide DataFrame and set index
df_month = pd.DataFrame({
'City': ['Hyderabad', 'Bangalore'],
'Jan': [100, 150],
'Feb': [120, 160]
}).set_index('City')

print("Original Wide Format:")
print(df_month)

# Stack
stacked = df_month.stack()
print("\nStacked (Long Format):")
print(stacked)

# Unstack
unstacked = stacked.unstack()
print("\nUnstacked (Back to Wide):")
print(unstacked)

Output:

Original Wide Format:

       Jan Feb
City
Hyderabad 100 120
Bangalore 150 160

Stacked (Long Format):

City         Month
Hyderabad Jan 100
       Feb 120
Bangalore Jan 150
        Feb 160
dtype: int64

Unstacked (Back to Wide):

          Jan Feb
City
Hyderabad 100 120
Bangalore 150 160

Advantages

Easy to merge data from different sources → e.g., sales + customers.
Reshape data for visualization (pivot charts).
Helps discover insights (patterns, trends).

Disadvantages

Wrong merge keys → data loss.
Complex reshaping can confuse beginners.
Large datasets = high memory usage.

Applications

Business Intelligence: Merge sales + customer demographics.
Finance: Consolidate balance sheet + cash flow.
Healthcare: Combine patient data from multiple systems.
Market Research: Merge survey + consumer data.

Data Cleaning

Data Cleaning and Preparation

Handling Missing Values

Why Should We Handle or Remove Missing Values?

Detecting missing values

Handling Missing Data Techniques

1. Deletion

2. Imputation

Outlier Detection techniques:

1. IQR Method (Interquartile Range)

2. Z-score

3. Visualization (Boxplot)

Data Transformation

Data Transformation Techniques

1. Scaling / Normalization / Standardization

Formula for Normalization

2. Encoding

3. Handling Missing Values

4. Removing Outliers

5.Feature Engineering

String Manipulation

Data Wrangling: Join

Inner Join

Outer Join

Left Join

Right Join

Merging with Multiple Keys

Merging Multiple Data Formats

Hierarchical Indexing

Combining and Merging Datasets

Reshaping Datasets

1. Melting

2. Pivoting

3. Stacking

Comments

Post a Comment

Popular posts from this blog

Introduction to Data Analysis & Python

Introduction to Pandas and Data Loading

Fundamentals of Python Programming & Numpy