Data Cleaning
Data Cleaning and Preparation
Handling Missing Values
Missing values are entries in your dataset that have no recorded data. In Python and Pandas, missing values are typically represented as:
NaN(Not a Number) — from NumPy-
None— a Python built-in null object
Example:
import pandas as pd
import numpy as np
data = {
"Name": ["Ramya", None, "Swathi"],
"Age": [24, 28, np.nan],
"City": ["Chennai", "Delhi", None]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Ramya 24.0 Chennai
1 None 28.0 Delhi
2 Swathi NaN None
Why Should We Handle or Remove Missing Values?
Missing data can significantly affect the quality and outcome of data analysis or machine learning models.
1. Accuracy and Validity
If missing values are ignored, calculations like mean, sum, correlation, etc., can give incorrect results.
2. Model Errors in Machine Learning
Most machine learning algorithms cannot work with missing values. They will throw errors unless the data is cleaned.
3. Bias in Analysis
If missing values are unevenly distributed across categories (e.g., mostly missing from one city or gender), it can lead to biased insights.
4. Data Integrity
Missing data can indicate issues in data collection, sensors, user input, or system bugs. It's important to understand the cause.
Detecting missing values
Missing values often occur due to incomplete data collection, errors during data entry, or system-related issues. To work effectively with your data, the first step is to detect where these missing values occur.
| Function | Description |
|---|---|
df.isnull() |
Returns a DataFrame of the same shape with True where data is missing |
df.notnull() |
Returns True where data is not missing |
df.isnull().sum() |
Sums the number of NaN values in each column |
Example:
import pandas as pd
import numpy as np
# Sample dataset with some missing values
data = {
"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish", None],
"Age": [23, np.nan, 27, 24, None, 29],
"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore", None]
}
df = pd.DataFrame(data)
# Display the DataFrame
print("Original DataFrame:")
print(df)
# Check for missing values (True for NaN or None)
print("\nDetecting missing values with df.isnull():")
print(df.isnull())
# Count total missing values per column
print("\nMissing values per column:")
print(df.isnull().sum())
# Detect non-missing (valid) values
print("\nDetecting non-missing values with df.notnull():")
print(df.notnull())
Output:
Original DataFrame:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi NaN Chennai
2 Lalitha 27.0 None
3 Lucky 24.0 Delhi
4 Harish NaN Bangalore
5 None 29.0 None
Detecting missing values with df.isnull():
Name Age City
0 False False False
1 False True False
2 False False True
3 False False False
4 False True False
5 True False True
Missing values per column:
Name 1
Age 2
City 2
dtype: int64
Detecting non-missing values with df.notnull():
Name Age City
0 True True True
1 True False True
2 True True False
3 True True True
4 True False True
5 False True False
Handling Missing Data Techniques
1. Deletion
Once missing values are detected, the next step is often to remove them, either rows or columns, depending on your analysis goals.
Why Remove Missing Values?
-
When missing values are few and do not significantly affect the dataset.
-
When you don't want to fill them with default or estimated values.
-
To ensure compatibility with tools/models that don’t accept NaNs.
Example:
import pandas as pd
import numpy as np
data = {
"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],
"Age": [23, np.nan, 27, 24, None],
"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi NaN Chennai
2 Lalitha 27.0 None
3 Lucky 24.0 Delhi
4 Harish NaN Bangalore
1. Remove Rows with Any Missing Values
df_any = df.dropna(how='any')
print("\nRows with any missing values removed:")
print(df_any)
Removes a row if even one value is missing.
Rows 1, 2, and 4 contain at least one
NaN → they get dropped.Output:
Name Age City
0 Ramya 23.0 Hyderabad
3 Lucky 24.0 Delhi
2. Remove Rows with All Missing Values
Example:
import pandas as pd
import numpy as np
data = {
"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish", None],
"Age": [23, np.nan, 27, 24, None, None],
"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore", None]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
df_all = df.dropna(how='all')
print("\nRows with all missing values removed:")
print(df_all)
Removes a row only if all values are missing.
5th row have all values missing -> they got dropped.
Output:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi NaN Chennai
2 Lalitha 27.0 None
3 Lucky 24.0 Delhi
4 Harish NaN Bangalore
3. Drop Columns with Any Missing Values
Example:
import pandas as pd
import numpy as np
data = {
"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],
"Age": [23, np.nan, 27, 24, None],
"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
df_col_any = df.dropna(axis=1, how='any')
print("\nColumns with any missing values removed:")
print(df_col_any)
Removes a column if any row contains a missing value.
Both
Age and City have missing values → they are dropped.Only
Name remains.Output:
Name
0 Ramya
1 Swathi
2 Lalitha
3 Lucky
4 Harish
4. Drop Columns with All Missing Values
df_col_all = df.dropna(axis=1, how='all')
print("\nColumns with all missing values removed:")
print(df_col_all)
Removes a column only if all values in it are missing.
No column has all missing values → nothing is dropped.
Output:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi NaN Chennai
2 Lalitha 27.0 None
3 Lucky 24.0 Delhi
4 Harish NaN Bangalore
2. Imputation
Imputation is the process of replacing missing data with substitute values so that the dataset remains usable for analysis or modeling.
Why Use Imputation?
- Maintains the size of the dataset (no loss of rows)
- Avoids errors during model training (many ML models can't handle NaN).
- Helps preserve important patterns in the data.
- Better than dropping data, especially when the missing values are few.
1. Mean Imputation (For numeric columns)
Replace missing values with the mean of the column.
Use when data is numeric and not heavily skewed (no outliers).
Example:
import pandas as pd
import numpy as np
data = {
"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],
"Age": [23, np.nan, 27, 24, None],
"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
df['Age'].fillna(df['Age'].mean(), inplace=True)
print("Data after mean Imputation:")
print(df)
Output:
Original Data:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi NaN Chennai
2 Lalitha 27.0 None
3 Lucky 24.0 Delhi
4 Harish NaN Bangalore
Data after mean Imputation:
Name Age City
0 Ramya 23.000000 Hyderabad
1 Swathi 24.666667 Chennai
2 Lalitha 27.000000 None
3 Lucky 24.000000 Delhi
4 Harish 24.666667 Bangalore
2. Mode Imputation (for categorical columns)
Fill missing string values with the most frequent value (mode).
Use for categorical columns (City, Name) when you want to fill with the most common value.
df['City'].fillna(df['City'].mode()[0], inplace=True)
print(df)
Output:
Name Age City
0 Ramya 23.000000 Hyderabad
1 Swathi 24.666667 Chennai
2 Lalitha 27.000000 Chennai
3 Lucky 24.000000 Delhi
4 Harish 24.666667 Bangalore
3. Median Imputation
Replace missing values with the median of that column.
Use when numeric data has outliers (median is more robust).
# Reset original data
df = pd.DataFrame(data)
df['Age'].fillna(df['Age'].median(), inplace=True)
print(df)
Output:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi 24.0 Chennai
2 Lalitha 27.0 None
3 Lucky 24.0 Delhi
4 Harish 24.0 Bangalore
4. Constant Value Imputation
You can fill missing values with fixed placeholders like "Unknown" or
0.Use when you want to mark missing values clearly.
# Fill missing values in Age and City
df = pd.DataFrame(data)
df.fillna({"Age": 0, "City": "Unknown"}, inplace=True)
print(df)
Output:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi 0.0 Chennai
2 Lalitha 27.0 Unknown
3 Lucky 24.0 Delhi
4 Harish 0.0 Bangalore
5. Forward Fill
You can fill missing values from the previous row.
Use for time series or where previous values can be logically carried forward.
df = pd.DataFrame(data)
df.fillna(method='ffill', inplace=True)
print(df)
Output:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi 23.0 Chennai
2 Lalitha 27.0 Chennai
3 Lucky 24.0 Delhi
4 Harish 24.0 Bangalore
6. Backward Fill
You can fill in the missing values from the next row.
df = pd.DataFrame(data)
df.fillna(method='bfill', inplace=True)
print(df)
Output:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi 27.0 Chennai
2 Lalitha 27.0 Delhi
3 Lucky 24.0 Delhi
4 Harish NaN Bangalore
7. Interpolation
Interpolation is a technique used to estimate missing values by using the values before and after the missing one. Instead of filling NaN with static values (like mean or mode), interpolation intelligently estimates what the missing value should be based on existing data.
import pandas as pd
import numpy as np
data = {
"Name": ["Ramya", "Swathi", "Lalitha", "Lucky", "Harish"],
"Age": [23, np.nan, 27, 24, None],
"City": ["Hyderabad", "Chennai", None, "Delhi", "Bangalore"]
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
# Interpolate missing Age values
df['Age'] = df['Age'].interpolate(method='linear')
print("After Interpolation:")
print(df)
Output:
Original Data:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi NaN Chennai
2 Lalitha 27.0 None
3 Lucky 24.0 Delhi
4 Harish NaN Bangalore
After Interpolation:
Name Age City
0 Ramya 23.0 Hyderabad
1 Swathi 25.0 Chennai
2 Lalitha 27.0 None
3 Lucky 24.0 Delhi
4 Harish 24.0 Bangalore
Outliers
An outlier is a data point that is very different (too high or too low) compared to the rest of the data.
Example:
Salaries = [30k, 35k, 32k, 34k, 5 crore]
Here, 5 crore is an outlier.
Outliers can disturb averages, scaling, and machine learning models.
So sometimes we remove or treat them.
Outlier Detection techniques:
1. IQR Method (Interquartile Range)
Based on percentiles:
-
Q1 = 25th percentile
-
Q3 = 75th percentile
-
IQR = Q3 − Q1
-
Outlier limits:
-
Lower = Q1 − 1.5 × IQR
-
Upper = Q3 + 1.5 × IQR
-
IQR is more robust for skewed or non-normal data.
Example:
import pandas as pd
# Step 1: Create the DataFrame
data = {'Salary': [30000, 32000, 35000, 34000, 37000, 36000, 39000, 41000, 43000, 5000000]}
df = pd.DataFrame(data)
# Step 2: Calculate Q1, Q3, and IQR
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
# Step 3: Define outlier limits
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR
# Step 4: Filter out outliers
df_iqr = df[(df['Salary'] >= lower_limit) & (df['Salary'] <= upper_limit)]
print("Cleaned Data using IQR:")
print(df_iqr)
Output:
Cleaned Data using IQR:
Salary
0 30000
1 32000
2 35000
3 34000
4 37000
5 36000
6 39000
7 41000
8 43000
Step-by-Step Manual Calculation
-
Sort the data in ascending order:
-
There are 10 values (even number), so:
-
Lower half (first 5 values):
[30000, 32000, 34000, 35000, 36000]
→ Q1 = middle value = 34000 -
Upper half (last 5 values):
[37000, 39000, 41000, 43000, 5000000]
→ Q3 = middle value = 41000
-
-
IQR (Interquartile Range) = Q3 − Q1
→IQR = 41000 − 34000 = 7000 -
Outlier limits:
-
Lower limit = Q1 − 1.5 × IQR→
34000 − 1.5 × 7000 = 34000 − 10500 = 23500 -
Upper limit = Q3 + 1.5 × IQR→
41000 + 1.5 × 7000 = 41000 + 10500 = 51500
-
-
Conclusion:
-
Any value below 23500 or above 51500 is an outlier.
-
In this case,
5000000is an outlier, and all other values are within range.
-
2. Z-score
Z-score measures how many standard deviations a value is from the mean.
Z-score works better for normally distributed data.
Formula:
Values with z > 3 or z < -3 are usually considered outliers (can be adjusted depending on the use case).
Example:
from scipy import stats
import numpy as np
# Step 1: Calculate Z-scores
z_scores = np.abs(stats.zscore(df['Salary']))
# Step 2: Filter based on threshold
df_z_clean = df[(z_scores < 3)]
print("Cleaned Data using Z-Score:")
print(df_z_clean)
Output:
Cleaned Data using Z-Score:
Salary
0 30000
1 32000
2 35000
3 34000
4 37000
5 36000
6 39000
7 41000
8 43000
Calculation:
Mean ≈ 552000
Standard Deviation ≈ 1,470,000
Outlier threshold = ±3 standard deviations
Outlier detected:
5000000 (Z ≈ 3.01)3. Visualization (Boxplot)
Visualizations are a quick way to see outliers.
Example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Define salary data
Salaries = [30000, 32000, 35000, 34000, 37000, 36000, 39000, 41000, 43000, 5000000]
# Create DataFrame with column name 'Salary'
df = pd.DataFrame({'Salary': Salaries})
# Plot boxplot
plt.boxplot(df['Salary'])
plt.title("Salary Distribution (with Outlier)")
plt.ylabel("Salary")
plt.show()
Output:
The box spans from Q1 (34000) to Q3 (41000).
The line inside the box is the median (around 36500).
The dot above the box represents the outlier
5000000.Data Transformation
Data transformation refers to the process of converting raw data into a suitable format that enables a computer (such as a machine learning model, database, or analysis tool) to understand, compare, and utilize it effectively.
⭐Raw data often has problems:
- Different scales (e.g., marks out of 100 vs age in years).
Student Math (out of 100) Height (cm)
A 90 150
B 80 170
C 70 160
If you directly give this table to a computer (machine learning model):
Math is between 70–90.Height is between 150–170.
So the computer thinks height is more important (bigger numbers) than math marks, even though we care about both equally. So, without scaling features with large values dominate.
- Different formats (e.g., text “A/B/C” vs numbers).
Computers don’t understand “A, B, C” — they need numbers.
- Missing values.
- Outliers (extreme values).
Data Transformation Techniques
1. Scaling / Normalization / Standardization
Scaling is used to bring all numerical values to a common range, especially when features have different units (like marks and age).
Types:
-
Normalization (Min-Max Scaling): scales values between 0 and 1
Formula for Normalization
Where:x = each value in the columnmin = smallest valuemax = largest value👉 This transformation scales all values between 0 and 1.
For example, numbers = [10, 20, 30, 40, 50]min=10max=50
Now apply the formula:
Value |
Calculation |
Result |
|---|---|---|
10 |
(10 − 10) / (50 − 10) = 0 / 40 |
0.00 |
20 |
(20 − 10) / 40 = 10 / 40 |
0.25 |
30 |
(30 − 10) / 40 = 20 / 40 |
0.50 |
40 |
(40 − 10) / 40 = 30 / 40 |
0.75 |
50 |
(50 − 10) / 40 = 40 / 40 |
1.00 |
All values are now between 0 and 1.
This is useful for algorithms like KNN, SVM, and Neural Networks that are sensitive to scale.
✅ Use Min–Max Scaling when you know the min and max are fixed and outliers are not a big issue.
Program:import pandas as pdfrom sklearn.preprocessing import MinMaxScaler# Sample datadf = pd.DataFrame({'Marks': [40, 60, 80, 100],'Age': [20, 25, 30, 35]})# Apply Min-Max Scalingscaler = MinMaxScaler()scaled = scaler.fit_transform(df)df_scaled = pd.DataFrame(scaled, columns=['Marks', 'Age'])print("Scaled Data:")print(df_scaled)Output:Scaled Data:Marks Age0 0.0 0.001 0.5 0.252 0.9 0.503 1.0 0.75
-
Standardization (Z-score): transforms values to have mean = 0 and std = 1
This makes data centered and scaled, which is especially useful for models that assume a normal distribution (e.g., logistic regression, SVM, etc.)
Formula for Standardization
Where:x = each valueμ = mean of the dataσ = standard deviation of the data
For example, numbers = [10, 20, 30, 40, 50]mean= 30σ ≈ 14.14
Now apply the formula:
Original |
Calculation |
Z-score (Standardized) |
|---|---|---|
10 |
(10 - 30) / 14.14 = -20 / 14.14 |
-1.41 |
20 |
(20 - 30) / 14.14 = -10 / 14.14 |
-0.71 |
30 |
(30 - 30) / 14.14 = 0 |
0.00 |
40 |
(40 - 30) / 14.14 = 10 / 14.14 |
0.71 |
50 |
(50 - 30) / 14.14 = 20 / 14.14 |
1.41 |
Z-score scaling centers the data and removes scale-related bias.
Program:import pandas as pdfrom sklearn.preprocessing import StandardScaler# Step 1: Create the DataFramedata = {'Value': [10, 20, 30, 40, 50]}df = pd.DataFrame(data)# Step 2: Apply Standardization (Z-score)scaler = StandardScaler()df['Standardized'] = scaler.fit_transform(df[['Value']])# Step 3: Print Resultprint("Standardized Data:")print(df)Output:Value Standardized0 10 -1.4142141 20 -0.7071072 30 0.0000003 40 0.7071074 50 1.414214
2. Encoding
Machine learning models need numeric data, so text (categorical) values must be encoded.
⦿ Label Encoding
Label Encoding converts categorical text values into numeric labels (integers).Each unique category is assigned a unique integer.
Example:
Input: ['Red', 'Green', 'Blue']Encoded: [ 2 , 1 , 0 ]
Program:
from sklearn.preprocessing import LabelEncoderdf = pd.DataFrame({'Grade': ['A', 'B', 'A', 'C']})encoder = LabelEncoder()df['Grade_Encoded'] = encoder.fit_transform(df['Grade'])print(df)
Output:
Grade Grade_Encoded0 A 01 B 12 A 03 C 2
⦿ One-Hot Encoding
One-Hot Encoding converts each category into a binary vector (0s and 1s).Each unique category becomes its own column with1indicating presence.
Example:
Input: ['Red', 'Green', 'Blue']Encoded:Red Green Blue0 1 0 01 0 1 02 0 0 1
Program:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'Grade': ['A', 'B', 'A', 'C']})
df = pd.DataFrame({'Grade': ['A', 'B', 'A', 'C']})df_one_hot = pd.get_dummies(df, columns=['Grade'])print(df_one_hot)
Output:
Grade_A Grade_B Grade_C0 1 0 01 0 1 02 1 0 03 0 0 1
3. Handling Missing Values
Missing values (
NaN) can break models. They should be filled (imputed) using strategies like mean, median, mode, or interpolation.4. Removing Outliers
Outliers are extreme values that can distort the model. Common methods to remove them are the IQR method, Z-score method, Visualization.
5.Feature Engineering
Feature Engineering means creating new features from existing data that might help the model.
Program:
df = pd.DataFrame({'Join_Date': pd.to_datetime(['2020-01-15', '2021-03-22', '2022-07-01'])})df['Year'] = df['Join_Date'].dt.yeardf['Month'] = df['Join_Date'].dt.monthdf['Day'] = df['Join_Date'].dt.dayprint(df)
Output:
Join_Date Year Month Day0 2020-01-15 2020 1 151 2021-03-22 2021 3 222 2022-07-01 2022 7 1
String Manipulation
String manipulation is a key part of data cleaning when working with textual or categorical data. It helps fix inconsistencies, remove noise, and prepare text for analysis or modeling.
⭐Common Use Cases:
-
Removing extra spaces
-
Converting to lowercase or uppercase
-
Replacing or removing unwanted characters
-
Extracting substrings
-
Splitting strings (e.g., "First Last" → ["First", "Last"])
-
Detecting or replacing missing/invalid values
-
Standardizing formats (e.g., emails, dates)
Indexing:
It means accessing a specific character in a string using its position.Python uses 0-based indexing (first character is at position 0).name = "Alice"print(name[0]) # Output: 'A'print(name[-1]) # Output: 'e' (last character)
Slicing:
Slicing means extracting a substring from a string using a range of indexes.name = "Charlie"print(name[0:3]) # Output: 'Cha' (from index 0 to 2)print(name[2:]) # Output: 'arlie' (from index 2 to end)
⭐Common String Methods :
|
Program:
import pandas as pd
# Raw Data
data = {
'Name': [' sw@thi '],
'Email': ['SwaTHI@GMAIL.Com']
}
df = pd.DataFrame(data)
print("🔹 Step 0: Original Data")
print(df, end="\n\n")
# Step 1: Remove spaces from Name using .str.strip()
df['Name_Step1'] = df['Name'].str.strip()
print("🔹 Step 1: After .str.strip() on Name")
print(df[['Name', 'Name_Step1']], end="\n\n")
# Step 2: Remove '@' from Name using .str.replace()
df['Name_Step2'] = df['Name_Step1'].str.replace('@', '', regex=False)
print("🔹 Step 2: After removing '@' from Name")
print(df[['Name_Step1', 'Name_Step2']], end="\n\n")
# Step 3: Capitalize Name using .str.capitalize()
df['Name_Clean'] = df['Name_Step2'].str.capitalize()
print("🔹 Step 3: Final Cleaned Name")
print(df[['Name_Step2', 'Name_Clean']], end="\n\n")
# Step 4: Lowercase Email using .str.lower()
df['Email_Step1'] = df['Email'].str.lower()
print("🔹 Step 4: Lowercase Email")
print(df[['Email', 'Email_Step1']], end="\n\n")
# Step 5: Replace 'gmail' with 'google' in Email
df['Email_Clean'] = df['Email_Step1'].str.replace('gmail', 'google', regex=False)
print("🔹 Step 5: After replacing 'gmail' with 'google'")
print(df[['Email_Step1', 'Email_Clean']], end="\n\n")
# Step 6: Extract domain using regex
df['Email_Domain'] = df['Email_Clean'].str.extract(r'@(.+)$')
print("🔹 Step 6: Extracted Domain from Email")
print(df[['Email_Clean', 'Email_Domain']], end="\n\n")
# Step 7: Indexing - First letter of name
df['First_Letter'] = df['Name_Clean'].str[0]
print("🔹 Step 7: First Letter of Cleaned Name")
print(df[['Name_Clean', 'First_Letter']], end="\n\n")
# Step 8: Slicing - First 3 letters of name
df['Name_Short'] = df['Name_Clean'].str[:3]
print("🔹 Step 8: First 3 Letters of Cleaned Name")
print(df[['Name_Clean', 'Name_Short']], end="\n\n")
# Step 9: Slicing - Last 3 letters of cleaned email
df['Email_End'] = df['Email_Clean'].str[-3:]
print("🔹 Step 9: Last 3 Letters of Cleaned Email")
print(df[['Email_Clean', 'Email_End']], end="\n\n")
Output:
🔹 Step 0: Original Data
Name Email
0 sw@thi SwaTHI@GMAIL.Com
🔹 Step 1: After .str.strip() on Name
Name Name_Step1
0 sw@thi sw@thi
🔹 Step 2: After removing '@' from Name
Name_Step1 Name_Step2
0 sw@thi swthi
🔹 Step 3: Final Cleaned Name
Name_Step2 Name_Clean
0 swthi Swthi
🔹 Step 4: Lowercase Email
Email Email_Step1
0 SwaTHI@GMAIL.Com swathi@gmail.com
🔹 Step 5: After replacing 'gmail' with 'google'
Email_Step1 Email_Clean
0 swathi@gmail.com swathi@google.com
🔹 Step 6: Extracted Domain from Email
Email_Clean Email_Domain
0 swathi@google.com google.com
🔹 Step 7: First Letter of Cleaned Name
Name_Clean First_Letter
0 Swthi S
🔹 Step 8: First 3 Letters of Cleaned Name
Name_Clean Name_Short
0 Swthi Swt
🔹 Step 9: Last 3 Letters of Cleaned Email
Email_Clean Email_End
0 swathi@google.com com
Data Wrangling: Join
A join combines rows from two tables based on a common key column (like
CustomerID, key, ProductCode, etc.).Program:
import pandas as pd
# Left table
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
'value1': [1, 2, 3, 4]
})
# Right table
df2 = pd.DataFrame({
'key': ['B', 'C', 'D', 'E'],
'value2': [5, 6, 7, 8]
})
print("df1:")
print(df1)
print("\ndf2:")
print(df2)
Output:
df1:
key value1
0 A 1
1 B 2
2 C 3
3 D 4
df2:
key value2
0 B 5
1 C 6
2 D 7
3 E 8
Inner Join
It keeps only the rows where the key exists in both DataFrames.
inner_join = pd.merge(df1, df2, on='key', how='inner')print(inner_join)
Output:
key value1 value2
0 B 2 5
1 C 3 6
2 D 4 7
Outer Join
It keeps all rows from both DataFrames, fills missing with NaN.
outer_join = pd.merge(df1, df2, on='key', how='outer')
print(outer_join)
Output:
key value1 value2
0 A 1.0 NaN
1 B 2.0 5.0
2 C 3.0 6.0
3 D 4.0 7.0
4 E NaN 8.0
Left Join
Keeps all rows fromdf1(left), adds matching data fromdf2.
left_join = pd.merge(df1, df2, on='key', how='left')
print(left_join)
Output:
key value1 value2
0 A 1 NaN
1 B 2 5.0
2 C 3 6.0
3 D 4 7.0
Right Join
Keeps all rows fromdf2(right), adds matching data fromdf1.
right_join = pd.merge(df1, df2, on='key', how='right')
print(right_join)
Output:
key value1 value2
0 B 2.0 5
1 C 3.0 6
2 D 4.0 7
3 E NaN 8
Merging with Multiple Keys
Merging with multiple keys means combining DataFrames on two or more columns that together define the relationship between rows in both tables.
Program:
import pandas as pd
# Left DataFrame
df1 = pd.DataFrame({
'key1': ['A', 'B', 'C', 'C'],
'key2': [1, 2, 3, 4],
'value1': ['apple', 'banana', 'cherry', 'date']
})
# Right DataFrame
df2 = pd.DataFrame({
'key1': ['A', 'B', 'C', 'C'],
'key2': [1, 2, 3, 5],
'value2': ['red', 'yellow', 'dark red', 'brown']
})
print("df1:")
print(df1)
print("\ndf2:")
print(df2)
merged = pd.merge(df1, df2, on=['key1', 'key2'], how='inner')
print("\nMerged on key1 and key2:")
print(merged)
Output:
df1:
key1 key2 value1
0 A 1 apple
1 B 2 banana
2 C 3 cherry
3 C 4 date
df2:
key1 key2 value2
0 A 1 red
1 B 2 yellow
2 C 3 dark red
3 C 5 brown
Merged on key1 and key2:
key1 key2 value1 value2
0 A 1 apple red
1 B 2 banana yellow
2 C 3 cherry dark red
Merging Multiple Data Formats
When you merge a CSV and an Excel file, you're essentially combining them into a single DataFrame using a shared column, just like joining tables in SQL.
Suppose:
-
data.csvc ontains personal details (name, age) -
data.xlsxcontains salary info (city, salary)
You can merge them using a common column like
key:Program:
import pandas as pd
csv_df = pd.read_csv('data.csv') # Load CSV file
excel_df = pd.read_excel('data.xlsx') # Load Excel file
# Join on common column
merged_df = pd.merge(csv_df, excel_df, on='key', how='inner')
Hierarchical Indexing
A MultiIndex (hierarchical index) in pandas allows you to have multiple levels of row or column labels in a single DataFrame.
This is useful when dealing with multi-dimensional data like
(Product, Month), (Region, Year), etc., while still using a 2D table.Example:
import pandas as pd
import numpy as np
# Define the row MultiIndex
index = pd.MultiIndex.from_tuples([
('Product A', 'January'),
('Product A', 'February'),
('Product B', 'January'),
('Product B', 'February'),
], names=['Product', 'Month'])
# Define the column MultiIndex
columns = pd.MultiIndex.from_tuples([
('East', 'Sales'),
('East', 'Profit'),
('West', 'Sales'),
('West', 'Profit'),
], names=['Region', 'Metric'])
# Create the DataFrame
data = [[100, 20, 120, 25],
[110, 22, 130, 27],
[90, 18, 100, 20],
[95, 19, 105, 22]]
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
Output:
Region East West
Metric Sales Profit Sales Profit
Product Month
Product A January 100 20 120 25
February 110 22 130 27
Product B January 90 18 100 20
February 95 19 105 22
◘ Inspecting MultiIndex
print(df.index.names) # ['Product', 'Month']
print(df.columns.names) # ['Region', 'Metric']
print(df.index.levels) # Index levels for rows
print(df.columns.levels) # Index levels for columns
1️⃣ Select all rows for Product A
print(df.loc['Product A'])
Output:
Region East West
Metric Sales Profit Sales Profit
Month
January 100 20 120 25
February 110 22 130 27
2️⃣ Select column ('West', 'Profit')
print(df[('West', 'Profit')])
Output:
Product Month Profit
Product A January 25
February 27
Product B January 20
February 22
Name: (West, Profit), dtype: int64
3️⃣ Cross-section for Month = 'January'
print(df.xs('January', level='Month'))
Output:
Region East West
Metric Sales Profit Sales Profit
Product
Product A 100 20 120 25
Product B 90 18 100 20
4️⃣ Cross-section on columns for Metric = 'Sales'
print(df.xs('Sales', level='Metric', axis=1))
Output:
Region East West
Product Month
Product A January 100 120
February 110 130
Product B January 90 100
February 95 105
Grouping and Aggregation by Level
print(df.groupby(level='Product').sum())
Output:
Region East West
Metric Sales Profit Sales Profit
Product
Product A 210 42 250 52
Product B 185 37 205 42
Combining and Merging Datasets
We can combine datasets in 2 ways:
1. Merging
merge()is used to join two DataFrames based on a common key column (like SQL joins). It allows combining datasets horizontally (side by side), matching rows using one or more columns.
Program:
import pandas as pd
# Left tabledf1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],'value1': [1, 2, 3, 4]})# Right tabledf2 = pd.DataFrame({'key': ['B', 'C', 'D', 'E'],'value2': [5, 6, 7, 8]})print("df1:")print(df1)print("\ndf2:")print(df2)inner_join = pd.merge(df1, df2, on='key', how='inner')
print(inner_join)
Output:
df1:
key value10 A 11 B 22 C 33 D 4df2:key value20 B 51 C 62 D 73 E 8
key value1 value20 B 2 51 C 3 62 D 4 7
2. Concatenation
concat()is used to stack multiple DataFrames either vertically (row-wise) or horizontally (column-wise). You don’t need to join on a key.
Program:
dfA = pd.DataFrame({'month': ['Jan', 'Feb'],'sales': [100, 120]})dfB = pd.DataFrame({'month': ['Mar', 'Apr'],'sales': [90, 130]})# Concatenate rowsconcat_rows = pd.concat([dfA, dfB], ignore_index=True)print("Row-wise Concat:\n", concat_rows)# Concatenate column-wise
concat_columns = pd.concat([dfA, dfB], axis=1)print("\nColumn-wise Concat:\n", concat_columns)
Output:
Row-wise Concat:month sales0 Jan 1001 Feb 1202 Mar 903 Apr 130Column-wise Concat:month sales month sales0 Jan 100 Mar 901 Feb 120 Apr 130
Reshaping Datasets
This is the process of converting datasets between wide and long formats to suit analysis or visualization needs.
1. Melting
It is the process of converting datasets from wide-form data (separate columns for each variable) into long-form, where each row is one observation.
Syntax:
pd.melt(dataframe, id_vars, value_vars, var_name, value_name)
Argument |
Description |
|---|---|
|
Your original wide DataFrame |
|
Columns to keep fixed (e.g., 'Name') |
|
Columns to melt (e.g., 'Python', 'C', 'Java') |
|
New name for the variable column (e.g., 'Subject') |
|
New name for the value column (e.g., 'Score') |
Program:
import pandas as pd# Wide formatdf = pd.DataFrame({'Name': ['Swathi', 'Surya', 'Swamy', 'Chinnu'],'Python': [90, 85, 88, 95],'C': [80, 75, 89, 92],'Java': [85, 78, 90, 88]})# Melt to long formatdf_melted = pd.melt(df,id_vars=['Name'],value_vars=['Python', 'C', 'Java'],var_name='Subject',value_name='Score')print(df_melted)
Output:
Name Subject Score0 Swathi Python 901 Surya Python 852 Swamy Python 883 Chinnu Python 954 Swathi C 805 Surya C 756 Swamy C 897 Chinnu C 928 Swathi Java 859 Surya Java 7810 Swamy Java 9011 Chinnu Java 88
2. Pivoting
Pivoting reshapes data from long format back to wide format. It groups and arranges data into a table layout with rows and columns.
Syntax:
dataframe.pivot_table(index, columns, values)
Argument |
Description |
|---|---|
|
Column to use as row labels (e.g., 'Name') |
|
Column to turn into new columns (e.g., 'Subject') |
|
Column with values to fill in the table (e.g., 'Score') |
Program:
# Pivot back to wide format
df_pivoted = df_melted.pivot_table(index='Name',
columns='Subject',
values='Score').reset_index()
print(df_pivoted)
Output:
Subject Name C Java Python
0 Chinnu 92 88 95
1 Surya 75 78 85
2 Swamy 89 90 88
3 Swathi 80 85 90
3. Stacking
It is the process of reorganising datasets to reshape the structure.
df.stack()turns columns into row levels (long format).df.unstack()moves row index levels into columns (wide format).
Program:
# Create wide DataFrame and set indexdf_month = pd.DataFrame({'City': ['Hyderabad', 'Bangalore'],'Jan': [100, 150],'Feb': [120, 160]}).set_index('City')print("Original Wide Format:")print(df_month)# Stackstacked = df_month.stack()print("\nStacked (Long Format):")print(stacked)# Unstackunstacked = stacked.unstack()print("\nUnstacked (Back to Wide):")print(unstacked)
Output:
Original Wide Format:Jan FebCityHyderabad 100 120Bangalore 150 160Stacked (Long Format):City MonthHyderabad Jan 100Feb 120Bangalore Jan 150Feb 160dtype: int64Unstacked (Back to Wide):Jan FebCityHyderabad 100 120Bangalore 150 160
Advantages
- Easy to merge data from different sources → e.g., sales + customers.
- Reshape data for visualization (pivot charts).
- Helps discover insights (patterns, trends).
Disadvantages
- Wrong merge keys → data loss.
- Complex reshaping can confuse beginners.
- Large datasets = high memory usage.
Applications
- Business Intelligence: Merge sales + customer demographics.
- Finance: Consolidate balance sheet + cash flow.
- Healthcare: Combine patient data from multiple systems.
- Market Research: Merge survey + consumer data.
.png)
Comments
Post a Comment