What are pandas?

Pandas is a powerful open-source data manipulation and analysis library for Python. It provides data structures like DataFrame and Series, which are used to work with structured data. It provides fast, flexible, data structures that make it easy to handle and analyze large datasets, similar to working with tables or spreadsheets.

Key Features

1. Data Structures

Series → A one-dimensional array-like object. It’s like a single column from a DataFrame.
DataFrame → This is the main structure in Pandas, similar to a table or spreadsheet. It consists of rows and columns, and you can easily manipulate, filter, and analyze data within it.

2. Data Manipulation

Handling missing data easily.
Data filtering, sorting, grouping, and aggregation.
Merging and joining datasets.
Time-series functionality (date parsing, resampling, etc.).

3. Performance

Built on NumPy, so it’s optimized for performance.
Vectorized operations for fast computation.

4. Integration

Works well with libraries like NumPy, Matplotlib, and Scikit-learn.

5. Indexing and Selection

Selecting rows and columns using labels or indices
Conditional selection using boolean indexing

6. Data handling

You can import/export data from various formats like CSV, Excel, SQL databases, JSON, HTML, etc.

Advantages

1. Efficient Data Handling

Pandas provides fast, flexible, and expressive data structures (DataFrames and Series), which are optimized for performance. These structures allow you to perform complex data manipulation quickly and efficiently.

2. Easy to Use

The syntax is intuitive, making it easy to learn for beginners and efficient for experienced users. The API is highly user-friendly, allowing you to perform data manipulations with minimal code.

3. Handling Missing Data

Pandas provides built-in methods for dealing with missing data (NaN), such as filling, dropping, or forward/backward filling, which is critical for real-world data analysis.

4. Support for Various File Formats

Pandas can read and write data from a variety of formats: CSV, Excel, JSON, HDF5, SQL databases, and more. This makes it a great choice for integrating different data sources.

5. Integration with Other Libraries

Pandas integrates seamlessly with other Python libraries like NumPy (for numerical computations), Matplotlib/Seaborn (for visualization), and Scikit-learn (for machine learning), allowing for a full data science workflow.

Disadvantages

1. Memory Consumption

While Pandas is fast and powerful, it can be memory-intensive, especially when working with large datasets. The DataFrame structure can consume a lot of RAM, which can lead to performance issues on very large datasets (over a few GBs).

2. Learning Curve

While basic operations in Pandas are easy to learn, more advanced features (like multi-indexing, pivoting, or complex aggregation) can be tricky for newcomers to grasp.

3. Inconsistent Performance on Specific Operations

While Pandas is highly optimized, some operations (like string manipulations, or using apply() with custom functions) can be slower than vectorized operations with NumPy or more specialized libraries.

Applications

1. Data Analysis

Exploratory Data Analysis (EDA): Pandas is commonly used for EDA to analyze datasets, calculate statistics (mean, median, mode), and understand the distribution of data.
Data Summarization: It’s used for calculating summary statistics like average, standard deviation, etc., as well as grouping data by categories for aggregation.

2. Data Cleaning

Handling Missing Data: Removing, filling, or interpolating missing values in datasets.
Data Transformation: Renaming columns, changing data types, applying functions across columns/rows.
Dealing with Outliers: Detecting and handling outliers in datasets.

3. Machine Learning Data Preprocessing

Data Preprocessing: Before applying machine learning algorithms, data often needs to be cleaned, transformed, and standardized. Pandas is a key tool in preparing data for machine learning models.
Feature Engineering: Creating new features from existing data (e.g., time-based features, aggregating data) is done with Pandas.

Pandas Data Structures

Pandas provides 2 data structures:

1. Series

A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, Python objects, etc.).

Key Features:

Homogeneous data (all elements are of the same type)
Associated labels (index)
Can be thought of like a column in a spreadsheet or SQL table

Series are generally created from:

Arrays
Lists
Dict

From Arrays:

import pandas as pd

import numpy as np

# Create a pandas Series from a NumPy array

arr= np.array([10, 20, 30, 40])

s = pd.Series(arr)

print(s)

Output:

0 10

1 20

2 30

3 40

dtype: int64

Left column is index
Right column is data

From Lists:

import pandas as pd

# Creating a Series from a list

s = pd.Series([10, 20, 30, 40])

print(s)

Output:

0 10

1 20

2 30

3 40

dtype: int64

From Dictionary:

# Create a pandas Series from a dictionary

dict_data = {'a': 100, 'b': 200, 'c': 300}

series_from_dict = pd.Series(dict_data)

print(series_from_dict)

Output:

a 100

b 200

c 300

dtype: int64

2. DataFrames

A DataFrame is a two-dimensional labeled data structure with columns (like a spreadsheet or SQL table).

Key Features:

Heterogeneous data (each column can have a different data type)
Labeled axes (rows and columns)
Flexible indexing and powerful data manipulation

DataFrames in pandas can be created from :

List
List of tuples
Dictionary
Excel Spreadsheet files
csv (common separated values) files

From a list:

import pandas as pd

# List of lists

data = [['Swathi', 'Vizag'], ['Surya', 'Hyderabad'], ['Chinnu', 'Pune']]

df = pd.DataFrame(data, columns=['Name', 'City'])

print(df)

Output:

Name City

0 Swathi Vizag

1 Surya Hyderabad

2 Chinnu Pune

From a list of tuples:

import pandas as pd

# List of tuples

data = [('Swathi', 'Vizag'), ('Surya', 'Hyderabad'), ('Chinnu', 'Pune')]

df = pd.DataFrame(data, columns=['Name', 'City'])

print(df)

Output:

Name City

0 Swathi Vizag

1 Surya Hyderabad

2 Chinnu Pune

From a Dictionary

import pandas as pd

# Creating a DataFrame with your data

data = {

'Name': ['Swathi', 'Surya', 'Chinnu'],

'City': ['Vizag', 'Hyderabad', 'Pune'],

'Age': [28, 32, 26],

'Salary': [60000, 75000, 52000]

}

df = pd.DataFrame(data)

print(df)

Output:

Name City Age Salary

0 Swathi Vizag 28 60000

1 Surya Hyderabad 32 75000

2 Chinnu Pune 26 52000

From an Excel File (.xlsx)

df = pd.read_excel('data.xlsx') # install openpyxl if not already: pip install openpyxl

print(df)

You need to have the Excel file (data.xlsx) in your directory, or provide the full path.

From a CSV File (.csv)

df = pd.read_csv('data.csv')

print(df)

You need to have the CSV file (data.csv) in your directory, or provide the full path.

Essential Functionality

Viewing data
This helps in getting an overview of your DataFrame - like how many rows/columns it has, data types, null values, etc.

Function Description

`df.he ad(n)` Returns first `n` rows (default 5)

`df.tail(n)` Returns last `n` rows

`df.shape` Returns tuple of (rows, columns)

`df.info()` Summary: non-null values, dtypes

`df.describe()` Statistics for numeric columns

`df.columns` List of column names

`df.dtypes` Data types of each column

Program

import pandas as pd

data = {
'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 65000, 70000, 52000]
}

df = pd.DataFrame(data)

print("Head:\n", df.head())
print("Shape:", df.shape)
print("Info:")
df.info()
print("Describe:\n", df.describe())

Output:

Head:
Name Age Salary
0 Dev 25 50000
1 Swathi 30 60000
2 Charlie 35 65000
3 Chinnu 40 70000
4 Surya 28 52000

Shape: (5, 3)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 Salary 5 non-null int64
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes

Describe:
Age Salary
count 5.000000 5.000000
mean 31.600000 59400.000000
std 6.577297 8202.439620
min 25.000000 50000.000000
25% 28.000000 52000.000000
50% 30.000000 60000.000000
75% 35.000000 65000.000000
max 40.000000 70000.000000

Renaming and Deleting Labels
To rename a label (i.e., a column name or an index label) in a pandas DataFrame, you can use the `rename()` method.
`rename()` does not modify the original DataFrame unless you set `inplace=True`.

Program:

import pandas as pd

data = {
'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 65000, 70000, 52000]
}

df = pd.DataFrame(data)

df = df.rename(columns={'Salary': 'Income'}, inplace=True)

Output:

Name Age Income
0 Dev 25 50000
1 Swathi 30 60000
2 Charlie 35 65000
3 Chinnu 40 70000
4 Surya 28 52000

To rename row index labels, use the `index` parameter:

Function	Description
`df.he ad(n)`	Returns first `n` rows (default 5)
`df.tail(n)`	Returns last `n` rows
`df.shape`	Returns tuple of (rows, columns)
`df.info()`	Summary: non-null values, dtypes
`df.describe()`	Statistics for numeric columns
`df.columns`	List of column names
`df.dtypes`	Data types of each column

Program:

df = df.rename(index={0: 'a', 1: 'b'}, inplace=True)

Output:

Name Age Income
a Dev 25 50000
b Swathi 30 60000
2 Charlie 35 65000
3 Chinnu 40 70000
4 Surya 28 52000

To delete labels (i.e., rows or columns) in a pandas DataFrame, you can use the `drop()` method.

Program:

import pandas as pd

data = {
'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 65000, 70000, 52000]
}

df = pd.DataFrame(data)

# Delete the 'Salary' column
df = df.drop(columns=['Salary']) # without using inplace

# Delete the row with index 3
df = df.drop(index=3)

print(df)

Output:

Name Age
0 Dev 25
1 Swathi 30
2 Charlie 35
4 Surya 28

Handling Missing Data
Missing values (NaNs) can distort analysis and must be handled — either by removing or filling them.

Function Description

`isnull()` Detects missing values

`dropna()` Removes missing rows/columns

`fillna()` Fills missing values

`interpolate()` Fills values using interpolation

Function	Description
`isnull()`	Detects missing values
`dropna()`	Removes missing rows/columns
`fillna()`	Fills missing values
`interpolate()`	Fills values using interpolation

Program:

import pandas as pd
import numpy as np

data = {
'A': [1, 2, np.nan, 4],
'B': [np.nan, 2, 3, 4]
}

df = pd.DataFrame(data)
print("Original:\n", df)

# Fill missing with 0
df_filled = df.fillna(0)
print("\nFilled:\n", df_filled)

# Drop rows with NaNs
df_dropped = df.dropna()
print("\nDropped:\n", df_dropped)

Output:

Original:
A B
0 1.0 NaN
1 2.0 2.0
2 NaN 3.0
3 4.0 4.0

Filled:
A B
0 1.0 0.0
1 2.0 2.0
2 0.0 3.0
3 4.0 4.0

Dropped:
A B
1 2.0 2.0
3 4.0 4.0

Sorting Data
To sort data in a pandas DataFrame, use the `sort_values(),`sort_index() method.

Function Description

`sort_values()` Sort by column values

`sort_index()` Sort by index

Program:

import pandas as pd

data = {
'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 65000, 70000, 52000]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Sort by Salary in descending order
df_sorted = df.sort_values(by='Salary', ascending=False)

# Print sorted DataFrame
print("\nSorted DataFrame by Salary (descending):\n", df_sorted)

Output:

Original DataFrame:
Name Age Salary
0 Dev 25 50000
1 Swathi 30 60000
2 Charlie 35 65000
3 Chinnu 40 70000
4 Surya 28 52000

Sorted DataFrame by Salary (descending):
Name Age Salary
3 Chinnu 40 70000
2 Charlie 35 65000
1 Swathi 30 60000
4 Surya 28 52000
0 Dev 25 50000

You can sort multiple columns. Sort by Age (ascending), then Salary (descending).

Program:
import pandas as pd

data = {
'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
'Age': [25, 30, 30, 40, 30],
'Salary': [50000, 60000, 55000, 70000, 52000]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Sort by Age (ascending), then Salary (descending)
df_sorted = df.sort_values(by=['Age', 'Salary'], ascending=[True, False])

print("\nSorted DataFrame by Age (asc) and Salary (desc):\n", df_sorted)

Function	Description
`sort_values()`	Sort by column values
`sort_index()`	Sort by index

Output:

Original DataFrame:
Name Age Salary
0 Dev 25 50000
1 Swathi 30 60000
2 Charlie 30 55000
3 Chinnu 40 70000
4 Surya 30 52000

Sorted DataFrame by Age (asc) and Salary (desc):
Name Age Salary
0 Dev 25 50000
1 Swathi 30 60000
2 Charlie 30 55000
4 Surya 30 52000
3 Chinnu 40 70000

The data is first sorted by Age in ascending order.
For rows with the same Age, it then sorts by Salary in descending order.

Applying functions
You can apply functions to rows, columns, or each element using:
`apply()` – for rows/columns on series & DataFrames
`map()` – for Series
`applymap()` – for element-wise operations on DataFrames

Program: apply()

import pandas as pd

df = pd.DataFrame({
'A': [1, 2, 3],
'B': [10, 20, 30]
})

# Apply sum row-wise (axis=1)
row_sums = df.apply(sum, axis=1)
print("Row sums:\n", row_sums)

# Apply max column-wise (axis=0)
col_max = df.apply(max, axis=0)
print("\nColumn max:\n", col_max)

Output:
Row sums:
0 11
1 22
2 33
dtype: int64

Column max:
A 3
B 30
dtype: int64

Program: map()

import pandas as pd

s = pd.Series([1, 2, 3])
squared = s.map(lambda x: x ** 2)
print("Squared Series:\n", squared)

Output:

Squared Series:
0 1
1 4
2 9
dtype: int64

Program: applymap()

import pandas as pd

df = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
})

new_df = df.applymap(lambda x: x * 10)
print("Element-wise multiplied DataFrame:\n", new_df)

Output:

Element-wise multiplied DataFrame:
A B
0 10 30
1 20 40

Summarizing and Computing Descriptive Statistics

1. Basic Summary: `describe()`

The describe() method gives you summary statistics of your numerical columns by default.

import pandas as pd

data = {

'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],

'Age': [25, 30, 35, 40, 28],

'Salary': [50000, 60000, 65000, 70000, 52000]

}

df = pd.DataFrame(data)

print(df.describe())

Age Salary

count 5.000000 5.000000

mean 31.600000 59400.000000

std 6.577297 8202.439620

min 25.000000 50000.000000

25% 28.000000 52000.000000

50% 30.000000 60000.000000

75% 35.000000 65000.000000

max 40.000000 70000.000000

Statistic	Meaning
`count`	Number of non-null values
`mean`	Average value
`std`	Standard deviation
`min`	Minimum value
`25%`	1st quartile (25th percentile)
`50%`	Median (2nd quartile, 50th percentile)
`75%`	3rd quartile (75th percentile)
`max`	Maximum value

2. Statistics on Specific Columns

# Mean of Salary

print("Mean Salary:", df['Salary'].mean())

# Median of Age

print("Median Age:", df['Age'].median())

# Standard deviation

print("Salary Std Dev:", df['Salary'].std())

Output:

Mean Salary: 59400.0

Median Age: 30.0

Salary Std Dev: 8202.439620425219

Function	Description	Example
`df.mean()`	Mean	`df['Age'].mean()`
`df.median()`	Median	`df['Salary'].median()`
`df.mode()`	Mode	`df['Age'].mode()`
`df.min()`	Minimum	`df['Salary'].min()`
`df.max()`	Maximum	`df['Salary'].max()`
`df.var()`	Variance	`df['Salary'].var()`
`df.std()`	Standard deviation	`df['Salary'].std()`
`df.count()`	Non-null count	`df.count()`
`df.sum()`	Total sum	`df['Salary'].sum()`
`df.skew()`	Skewness	`df.skew()`
`df.kurt()`	Kurtosis	`df.kurt()`

3. groupby()

The groupby() function splits the data into groups based on some criteria (like a column), and then you can apply an aggregate function (like sum(), mean(), count()).

Program:

import pandas as pd

data = {

'Department': ['HR', 'IT', 'HR', 'Finance', 'IT', 'Finance'],

'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya', 'Kiran'],

'Salary': [50000, 60000, 52000, 70000, 65000, 62000],

'Age': [25, 30, 35, 40, 28, 32]

}

df = pd.DataFrame(data)

print("Original DataFrame:\n", df)

# Group by one column and aggregate

salary_by_dept = df.groupby('Department')['Salary'].sum()

print("\nTotal Salary by Department:\n", salary_by_dept)

# Multiple Aggregations per Group

salary_stats = df.groupby('Department')['Salary'].agg(['count', 'mean', 'max', 'min'])

print("\nSalary Statistics by Department:\n", salary_stats)

# Group by Multiple Columns

grouped = df.groupby(['Department', 'Age'])['Salary'].mean()

print("\nSalary by Department and Age:\n", grouped)

# Apply Different Aggregations to Multiple Columns

summary = df.groupby('Department').agg({

'Salary': ['mean', 'max'],

'Age': 'median'

})

print("\nCustom Aggregation:\n", summary)

Output:

Total Salary by Department:

Department

Finance 132000

HR 102000

IT 125000

Name: Salary, dtype: int64

Salary Statistics by Department:

count mean max min

Department

Finance 2 66000.0 70000 62000

HR 2 51000.0 52000 50000

IT 2 62500.0 65000 60000

Salary by Department and Age:

Department Age

Finance 32 62000

40 70000

HR 35 50000

35 52000

IT 28 65000

30 60000

Name: Salary, dtype: int64

Custom Aggregation:

Salary Age

mean max median

Department

Finance 66000.0 70000 36

HR 51000.0 52000 30

IT 62500.0 65000 29

Data Loading, Storage, and File Formats

Reading and Writing Data in Text Format

The Pandas library in Python makes it extremely easy to handle structured text data such as CSV, TSV, and TXT files. Whether you are saving DataFrames to files or reading large datasets efficiently, Pandas provides powerful and user-friendly functions to manage text-based data.

1. Writing Data to a Text File

The to_csv() function in Pandas allows you to write a DataFrame to a text file with a chosen delimiter.


import pandas as pd

# Create a simple DataFrame
data = {
    "Name": ["Swati", "Rahul", "Ananya"],
    "Age": [25, 30, 22],
    "City": ["Hyderabad", "Delhi", "Mumbai"]
}
df = pd.DataFrame(data)

# Save to text file (CSV format with tab separator)
df.to_csv("data.txt", sep="\t", index=False)
print("Data written to data.txt")

In the code above:

The sep="\t" argument sets the delimiter to a tab character. You can also use "," for CSV files, " " for space-separated files, or any other custom delimiter.
The index=False argument prevents Pandas from writing the default row numbers to the file, resulting in a cleaner output.

The resulting file (data.txt) will contain:


Name	Age	City
Swati	25	Hyderabad
Rahul	30	Delhi
Ananya	22	Mumbai

2. Reading Data from a Text File

To read the data back into a DataFrame, use the read_csv() function with the same separator that was used while writing the file.


df_read = pd.read_csv("data.txt", sep="\t")
print(df_read)

This reads the contents of the tab-separated file into a DataFrame and prints it.

3. Writing Only Specific Columns

You can also write only selected columns from your DataFrame to a text file. For example:


df.to_csv("names.txt", columns=["Name", "City"], sep=" ", index=False)

This code saves only the "Name" and "City" columns using a space as the separator. The output will look like:


Name City
Swati Hyderabad
Rahul Delhi
Ananya Mumbai

This is useful when you don’t need to save the entire dataset.

4. Reading Large Text Files in Chunks

When working with very large text files, loading the entire file at once may not be memory-efficient. Pandas allows you to read large files in chunks using the chunksize parameter.


for chunk in pd.read_csv("data.txt", sep="\t", chunksize=2):
    print(chunk)

This code reads the file in chunks of 2 rows at a time and prints each chunk. This method helps reduce memory usage when dealing with large datasets.

Output:


    Name  Age      City
0  Swati   25  Hyderabad
1  Rahul   30      Delhi

     Name  Age   City
2  Ananya   22  Mumbai

Each chunk is a smaller DataFrame that can be processed independently.

5. Exporting Data to Other Text Formats

You can export data to different text-based formats simply by changing the delimiter in the to_csv() function.

Writing to a CSV File (Comma-Separated)


df.to_csv("data.csv", index=False)

This saves the file in the common CSV format where columns are separated by commas.

Output:


Name,Age,City
Swati,25,Hyderabad
Rahul,30,Delhi
Ananya,22,Mumbai

Writing to a TSV File (Tab-Separated)


df.to_csv("data.tsv", sep="\t", index=False)

This saves the file using tabs to separate columns.

Output:


Name	Age	City
Swati	25	Hyderabad
Rahul	30	Delhi
Ananya	22	Mumbai

Writing to a File with a Custom Delimiter (e.g., Pipe `|`)


df.to_csv("data_pipe.txt", sep="|", index=False)

This will create a file where columns are separated by the pipe (|) character.

Output:


Name|Age|City
Swati|25|Hyderabad
Rahul|30|Delhi
Ananya|22|Mumbai

6. Reading and Writing JSON Data

In addition to working with CSV and text files, Pandas also provides full support for JSON (JavaScript Object Notation) — a popular format for APIs and web data.

Writing Data to a JSON File

You can easily convert a DataFrame into a JSON file using the to_json() method.


df.to_json("data.json", orient="records", lines=True)

orient="records": Each row is converted into a separate JSON object (dictionary).
lines=True: Each JSON object is written on a new line (known as JSON Lines format), which is efficient for large files and commonly used in data pipelines.

Output in data.json:


{"Name":"Swati","Age":25,"City":"Hyderabad"}
{"Name":"Rahul","Age":30,"City":"Delhi"}
{"Name":"Ananya","Age":22,"City":"Mumbai"}

You can also write the entire file as a single JSON object (array of records):


df.to_json("data_array.json", orient="records", lines=False)

This will produce:


[
  {"Name":"Swati","Age":25,"City":"Hyderabad"},
  {"Name":"Rahul","Age":30,"City":"Delhi"},
  {"Name":"Ananya","Age":22,"City":"Mumbai"}
]

Reading Data from a JSON File

To read a JSON file into a DataFrame, use pd.read_json():


# For standard JSON array
df_json = pd.read_json("data_array.json")
print(df_json)

If the JSON file uses the JSON Lines format, include the lines=True argument:


df_jsonl = pd.read_json("data.json", lines=True)
print(df_jsonl)

7. Reading and Writing TSV Files (Tab-Separated Values)
TSV (Tab-Separated Values) files are very similar to CSV files, with the key difference being that they use a tab character (\t) instead of a comma to separate fields.
This format is especially useful in domains such as bioinformatics, natural language processing, and spreadsheet exports where the data may already contain commas, making the CSV format less reliable.
To write a DataFrame to a TSV file using Pandas, you can use the to_csv() function with the sep="\t" argument. 
For example:
df.to_csv("data.tsv", sep="\t", index=False)
In this code, sep="\t" tells Pandas to use a tab character as the field separator, and index=False prevents the index column from being written to the file. 
The output in the data.tsv file will look like this:
Name	Age	City
Swati	25	Hyderabad
Rahul	30	Delhi
Ananya	22	Mumbai
This format is clean, easy to read, and more robust than CSV when your data contains commas within fields.
To read a TSV file back into a DataFrame, you can use the read_csv() function with the same sep="\t" argument:
df_tsv = pd.read_csv("data.tsv", sep="\t")
print(df_tsv)
This command correctly interprets the tab-separated values in the file and loads them into a DataFrame with the appropriate column structure.
Binary Formats
Pandas supports efficient binary file formats such as Pickle and HDF5. 
These formats are faster to read and write, preserve data types, and are better suited for handling large datasets in memory or on disk.
Pickle Format
Pickle is Python’s built-in binary serialization format. 
It is used to save and load Python objects — including Pandas DataFrames — in a compact, byte-stream format. 
Pickle files are Python-specific, which means they may not be compatible across different programming languages or platforms.
Save a DataFrame as a Pickle File
To save a DataFrame as a Pickle file, use the to_pickle() method:
df.to_pickle("data.pkl")
This command serializes the DataFrame object and writes it to a file called data.pkl. It saves everything — including column data types, index labels, and metadata — exactly as it is.
Read a Pickle File
To read the Pickle file back into a DataFrame, use the read_pickle() function:
df_loaded = pd.read_pickle("data.pkl")
print(df_loaded)
This deserializes the binary data and restores the original DataFrame structure and content. 
The process is very fast and does not require any additional formatting or parsing.
When to Use Pickle?
When working entirely within a Python environment
When you want a quick and exact snapshot of a DataFrame
Not suitable if the data needs to be shared with other tools or languages
Example:
import pandas as pd

# Create DataFrame
data_pickle = {
    "Name": ["Swathi", "Surya", "Chinnu"],
    "Age": [23, 27, 21],
    "City": ["Hyderabad", "Chennai", "Bangalore"]
}

df_pickle = pd.DataFrame(data_pickle)
print("DataFrame for Pickle:")
print(df_pickle)

# Save df_pickle to a Pickle file
df_pickle.to_pickle("students_info.pkl")
print("DataFrame saved to Pickle as 'students_info.pkl'")

# Load the Pickle file
df_pickle_loaded = pd.read_pickle("students_info.pkl")
print("DataFrame loaded from Pickle:")
print(df_pickle_loaded)


Output:

DataFrame for Pickle:
     Name  Age      City
0  Swathi   23  Hyderabad
1   Surya   27    Chennai
2  Chinnu   21  Bangalore

DataFrame saved to Pickle as 'students_info.pkl

DataFrame loaded from Pickle:
     Name  Age      City
0  Swathi   23  Hyderabad
1   Surya   27    Chennai
2  Chinnu   21  Bangalore


HDF5 Format
HDF5 (Hierarchical Data Format version 5) is a binary file format designed to store and organize large amounts of numerical data. 
It supports hierarchical storage (like folders within folders), compression, and selective data access.
Pandas provides built-in support for HDF5 using the PyTables library, which must be installed (pip install tables).
Save a DataFrame to an HDF5 File
To save a DataFrame to an HDF5 file, use the to_hdf() method:
df.to_hdf("data.h5", key="mydata", mode="w")
"data.h5" is the name of the HDF5 file.
key="mydata" defines the name (or path) under which the DataFrame is stored.
mode="w" specifies that the file should be written from scratch (overwrite mode).
Read from an HDF5 File
To read the HDF5 file and restore the DataFrame, use the read_hdf() function:
df_hdf = pd.read_hdf("data.h5", key="mydata")
print(df_hdf)
This reads the specific dataset stored under the key "mydata" from the HDF5 file and loads it into a DataFrame.
When to Use HDF5?
When dealing with large datasets that don’t fit entirely in memory
When you need random access to parts of the data (e.g., specific rows or columns)
When you want to store multiple datasets in a single file with a hierarchical structure
Suitable for scientific computing, time series, and numerical data storage
Example:
import pandas as pd

# Create DataFrame
data_hdf = {
    "Student": ["Swathi", "Surya", "Chinnu"],
    "Marks": [85, 92, 78],
    "Grade": ["A", "A+", "B"]
}

df_hdf = pd.DataFrame(data_hdf)
print("DataFrame for HDF5:")
print(df_hdf)

# Save df_hdf to an HDF5 file
df_hdf.to_hdf("student_grades.h5", key="grades", mode="w")
print("DataFrame saved to HDF5 as 'student_grades.h5'")

# Load the HDF5 file
df_hdf_loaded = pd.read_hdf("student_grades.h5", key="grades")
print("DataFrame loaded from HDF5:")
print(df_hdf_loaded)

Output:

DataFrame for HDF5:
  Student  Marks Grade
0  Swathi     85     A
1   Surya     92    A+
2  Chinnu     78     B

DataFrame saved to HDF5 as 'student_grades.h5'

DataFrame loaded from HDF5:
  Student  Marks Grade
0  Swathi     85     A
1   Surya     92    A+
2  Chinnu     78     B

Both Pickle and HDF5 are powerful options when working with large or complex datasets in Pandas. 
Choose Pickle for quick, Python-only tasks and HDF5 for scalable, multi-dataset storage in numerical or scientific computing.

Interacting with Web APIs
Modern data workflows often require fetching data from the web — especially through APIs (Application Programming Interfaces).
APIs allow you to access data provided by external services in a structured format like JSON.

Using APIs with Pandas offers several powerful advantages, especially for data analysis and automation:
Real-time data access: APIs allow you to fetch the most up-to-date information directly from a source. Whether it's weather data, stock prices, or social media feeds, you no longer have to rely on outdated or static files.
No manual downloads: Instead of downloading CSV or Excel files manually from websites or dashboards, APIs let you programmatically retrieve data in real-time. This reduces manual effort and improves productivity.
Seamless integration with Pandas: Most APIs return data in JSON format, which can be easily converted into a Pandas DataFrame. This means you can go directly from API to analysis, skipping time-consuming data preparation steps.
Automated workflows: You can set up scheduled scripts or data pipelines that fetch API data daily, hourly, or in real time, feeding directly into dashboards, models, or reports.
Let’s go step-by-step through the process of connecting to a public API, retrieving data, and converting it into a structured format using Pandas.

Step 1: Import the Required Libraries

First, import the necessary Python libraries. 
The requests library is used to send HTTP requests, while pandas is used for data manipulation.
         import requests
         import pandas as pd

Step 2: Send a Request to the API

Next, define the API endpoint you want to access. 
In this example, we are using a public testing API called JSONPlaceholder, which returns fake blog post data.

        url = "https://jsonplaceholder.typicode.com/posts"
        response = requests.get(url)  # Send a GET request to the API

The requests.get() function sends a request to the API URL and stores the response in a variable called response.

Step 3: Check the Response Status Code

        print(response.status_code)

If the output is 200, it means the request was successful and the API returned the data as expected. 
Other codes like 404 or 500 indicate errors such as "Not Found" or "Server Error."

Step 4: Convert the JSON Response into a Python Object

Most APIs return data in JSON format. 
To work with it in Python, you need to convert this JSON into a Python dictionary or list.

        data = response.json()

The response.json() method parses the returned JSON content and converts it into native Python data structures (usually a list of dictionaries).

Step 5: Load the Data into a Pandas DataFrame

Now that the data is in Python format, you can easily load it into a Pandas DataFrame:

        df = pd.DataFrame(data)

This step converts the list of dictionaries (each dictionary representing a record) into a tabular structure that Pandas can work with. 
Each key in the dictionary becomes a column in the DataFrame.

Interacting with Databases
In data science projects, structured data is often stored in databases. 
Analysts and data scientists frequently need to connect to these databases to extract, query, and analyze data using Python.
The pandas library, in combination with SQLAlchemy, allows seamless integration between Python and SQL databases. 
This enables users to run SQL queries and load results directly into Pandas DataFrames.

SQL databases play a crucial role in managing large volumes of structured data. 
They are used across industries such as finance, healthcare, e-commerce, and telecommunications for the following reasons:
SQL databases provide efficient data storage and retrieval mechanisms.
They allow querying large datasets quickly and precisely.
Analysts can work with both historical and real-time data for better decision-making.
Databases ensure data integrity, security, and scalability — which is essential for enterprise-level applications.
To work with SQL databases in Pandas, you’ll need to install the following libraries:
        pip install pandas sqlalchemy

Step 1: Connect to an SQL Database

You can connect to a database using SQLAlchemy's create_engine() function. 

from sqlalchemy import create_engine
# Creating a connection to a SQLite database
engine = create_engine('sqlite:///my_database.db')

For other databases, your connection string would change. For example:
MySQL: 'mysql+pymysql://user:password@localhost/dbname'
PostgreSQL: 'postgresql://user:password@localhost/dbname'
Step 2: Query and Load Data into Pandas

Once the connection is established, you can use SQL queries to fetch data from the database and load it into a Pandas DataFrame for analysis.

import pandas as pd

# Define a SQL query to fetch data
sql_query = 'SELECT * FROM your_table_name'

# Execute the query and store results in a DataFrame
df = pd.read_sql(sql_query, con=engine)

print("DataFrame from SQL Database:")
print(df.head())

Step 3: Executing SQL Commands from Python

You can also create new tables, insert data, or modify the schema directly using SQL statements.
For example, to create a new table:

create_table_sql = '''
CREATE TABLE IF NOT EXISTS new_table (
    id INTEGER PRIMARY KEY,
    name TEXT,
    age INTEGER
)
'''

# Execute the command using a database connection
with engine.connect() as connection:
    connection.execute(create_table_sql)

This code will create a new table named new_table with three columns: id, name, and age.

Example:

import pandas as pd
from sqlalchemy import create_engine, text

# Step 1: Create SQLite database connection using SQLAlchemy
engine = create_engine('sqlite:///people.db')

# Step 2: Create a new table (if not exists)
create_table_sql = '''
CREATE TABLE IF NOT EXISTS people (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT,
    age INTEGER,
    city TEXT
)
'''
with engine.connect() as connection:
    connection.execute(text(create_table_sql))

# Step 3: Insert sample data (Swathi, Surya, Chinnu)
insert_data_sql = '''
INSERT INTO people (name, age, city)
VALUES 
    ('Swathi', 28, 'Hyderabad'),
    ('Surya', 25, 'Chennai'),
    ('Chinnu', 22, 'Bangalore')
'''
with engine.connect() as connection:
    connection.execute(text(insert_data_sql))

# Step 4: Query the data using Pandas
df = pd.read_sql('SELECT * FROM people', con=engine)

# Step 5: Display the DataFrame
print("Data loaded from SQL database into Pandas:")
print(df)

Output:

 Data loaded from SQL database into Pandas:
   id    name  age       city
0   1  Swathi   28  Hyderabad
1   2   Surya   25    Chennai
2   3  Chinnu   22  Bangalore

Introduction to Pandas and Data Loading

What are pandas?

Pandas Data Structures

Essential Functionality

Program:df = df.rename(index={0: 'a', 1: 'b'}, inplace=True)Output: Name Age Incomea Dev 25 50000b Swathi 30 600002 Charlie 35 650003 Chinnu 40 700004 Surya 28 52000To delete labels (i.e., rows or columns) in a pandas DataFrame, you can use the drop() method.

Handling Missing DataMissing values (NaNs) can distort analysis and must be handled — either by removing or filling them. Function Description isnull() Detects missing values dropna() Removes missing rows/columns fillna() Fills missing values interpolate() Fills values using interpolation

Applying functionsYou can apply functions to rows, columns, or each element using:apply() – for rows/columns on series & DataFramesmap() – for Seriesapplymap() – for element-wise operations on DataFrames

Output:Squared Series:0 11 42 9dtype: int64

Program: applymap()import pandas as pddf = pd.DataFrame({ 'A': [1, 2], 'B': [3, 4]})new_df = df.applymap(lambda x: x * 10)print("Element-wise multiplied DataFrame:\n", new_df)Output:Element-wise multiplied DataFrame: A B0 10 301 20 40

Summarizing and Computing Descriptive Statistics

1. Basic Summary: describe()

2. Statistics on Specific Columns

3. groupby()

Data Loading, Storage, and File Formats

Reading and Writing Data in Text Format

1. Writing Data to a Text File

2. Reading Data from a Text File

3. Writing Only Specific Columns

4. Reading Large Text Files in Chunks

5. Exporting Data to Other Text Formats

Writing to a CSV File (Comma-Separated)

Writing to a TSV File (Tab-Separated)

Writing to a File with a Custom Delimiter (e.g., Pipe |)

6. Reading and Writing JSON Data

Writing Data to a JSON File

Reading Data from a JSON File

7. Reading and Writing TSV Files (Tab-Separated Values)

Binary Formats

Pickle Format

Save a DataFrame as a Pickle File

Read a Pickle File

When to Use Pickle?

Example:

HDF5 Format

Save a DataFrame to an HDF5 File

Read from an HDF5 File

When to Use HDF5?

Example:

Interacting with Web APIs

Interacting with Databases

Comments

Post a Comment

Popular posts from this blog

Introduction to Data Analysis & Python

Fundamentals of Python Programming & Numpy

Program:

df = df.rename(index={0: 'a', 1: 'b'}, inplace=True)

Output:

Name Age Income
a Dev 25 50000
b Swathi 30 60000
2 Charlie 35 65000
3 Chinnu 40 70000
4 Surya 28 52000

To delete labels (i.e., rows or columns) in a pandas DataFrame, you can use the `drop()` method.

Applying functions
You can apply functions to rows, columns, or each element using:
`apply()` – for rows/columns on series & DataFrames
`map()` – for Series
`applymap()` – for element-wise operations on DataFrames

Output:

Squared Series:
0 1
1 4
2 9
dtype: int64

Program: applymap()

import pandas as pd

df = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
})

new_df = df.applymap(lambda x: x * 10)
print("Element-wise multiplied DataFrame:\n", new_df)

Output:

Element-wise multiplied DataFrame:
A B
0 10 30
1 20 40

1. Basic Summary: `describe()`

Writing to a File with a Custom Delimiter (e.g., Pipe `|`)