Introduction to Pandas and Data Loading

 

What are pandas?

Pandas is a powerful open-source data manipulation and analysis library for Python. It provides data structures like DataFrame and Series, which are used to work with structured data. It provides fast, flexible, data structures that make it easy to handle and analyze large datasets, similar to working with tables or spreadsheets. 

Key Features

1. Data Structures
  • Series → A one-dimensional array-like object. It’s like a single column from a DataFrame.
  • DataFrame → This is the main structure in Pandas, similar to a table or spreadsheet. It consists of rows and columns, and you can easily manipulate, filter, and analyze data within it.

2. Data Manipulation
  • Handling missing data easily.
  • Data filtering, sorting, grouping, and aggregation.
  • Merging and joining datasets.
  • Time-series functionality (date parsing, resampling, etc.).
3. Performance
  • Built on NumPy, so it’s optimized for performance.
  • Vectorized operations for fast computation.
4. Integration
  • Works well with libraries like NumPy, Matplotlib, and Scikit-learn.
5. Indexing and Selection
  • Selecting rows and columns using labels or indices
  • Conditional selection using boolean indexing
6. Data handling
  • You can import/export data from various formats like CSV, Excel, SQL databases, JSON, HTML, etc.

Advantages
1. Efficient Data Handling
  • Pandas provides fast, flexible, and expressive data structures (DataFrames and Series), which are optimized for performance. These structures allow you to perform complex data manipulation quickly and efficiently.
2. Easy to Use
  • The syntax is intuitive, making it easy to learn for beginners and efficient for experienced users. The API is highly user-friendly, allowing you to perform data manipulations with minimal code.
3. Handling Missing Data
  • Pandas provides built-in methods for dealing with missing data (NaN), such as filling, dropping, or forward/backward filling, which is critical for real-world data analysis.
4. Support for Various File Formats
  • Pandas can read and write data from a variety of formats: CSV, Excel, JSON, HDF5, SQL databases, and more. This makes it a great choice for integrating different data sources.
5. Integration with Other Libraries
  • Pandas integrates seamlessly with other Python libraries like NumPy (for numerical computations), Matplotlib/Seaborn (for visualization), and Scikit-learn (for machine learning), allowing for a full data science workflow.

Disadvantages

1. Memory Consumption
  • While Pandas is fast and powerful, it can be memory-intensive, especially when working with large datasets. The DataFrame structure can consume a lot of RAM, which can lead to performance issues on very large datasets (over a few GBs).
2. Learning Curve
  • While basic operations in Pandas are easy to learn, more advanced features (like multi-indexing, pivoting, or complex aggregation) can be tricky for newcomers to grasp.
3. Inconsistent Performance on Specific Operations
  • While Pandas is highly optimized, some operations (like string manipulations, or using apply() with custom functions) can be slower than vectorized operations with NumPy or more specialized libraries.
Applications

1. Data Analysis
  • Exploratory Data Analysis (EDA): Pandas is commonly used for EDA to analyze datasets, calculate statistics (mean, median, mode), and understand the distribution of data.
  • Data Summarization: It’s used for calculating summary statistics like average, standard deviation, etc., as well as grouping data by categories for aggregation.
2. Data Cleaning
  • Handling Missing Data: Removing, filling, or interpolating missing values in datasets.
  • Data Transformation: Renaming columns, changing data types, applying functions across columns/rows.
  • Dealing with Outliers: Detecting and handling outliers in datasets.
3. Machine Learning Data Preprocessing
  • Data Preprocessing: Before applying machine learning algorithms, data often needs to be cleaned, transformed, and standardized. Pandas is a key tool in preparing data for machine learning models.
  • Feature Engineering: Creating new features from existing data (e.g., time-based features, aggregating data) is done with Pandas.

Pandas Data Structures

Pandas provides 2 data structures:

1. Series

A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, Python objects, etc.).

Key Features:
  • Homogeneous data (all elements are of the same type)
  • Associated labels (index)
  • Can be thought of like a column in a spreadsheet or SQL table
Series are generally created from:
  1.  Arrays 
  2.  Lists 
  3.  Dict
From Arrays:

import pandas as pd
import numpy as np

# Create a pandas Series from a NumPy array
arr= np.array([10, 20, 30, 40])
s = pd.Series(arr)
print(s)

Output:

0    10
1    20
2    30
3    40
dtype: int64
  • Left column is index
  • Right column is data
From Lists:

import pandas as pd

# Creating a Series from a list
s = pd.Series([10, 20, 30, 40])
print(s)

Output:

0    10
1    20
2    30
3    40
dtype: int64

From Dictionary:

# Create a pandas Series from a dictionary
dict_data = {'a': 100, 'b': 200, 'c': 300}
series_from_dict = pd.Series(dict_data)

print(series_from_dict)

Output:

a    100
b    200
c    300
dtype: int64

2. DataFrames

A DataFrame is a two-dimensional labeled data structure with columns (like a spreadsheet or SQL table).

Key Features:
  • Heterogeneous data (each column can have a different data type)
  • Labeled axes (rows and columns)
  • Flexible indexing and powerful data manipulation
DataFrames in pandas can be created from :
  1. List 
  2. List of tuples 
  3. Dictionary  
  4. Excel Spreadsheet files 
  5. csv (common separated values) files
From a list:

import pandas as pd

# List of lists
data = [['Swathi', 'Vizag'], ['Surya', 'Hyderabad'], ['Chinnu', 'Pune']]
df = pd.DataFrame(data, columns=['Name', 'City'])
print(df)

Output:

     Name       City
0  Swathi      Vizag
1   Surya  Hyderabad
2  Chinnu       Pune


From a list of tuples:

import pandas as pd
# List of tuples
data = [('Swathi', 'Vizag'), ('Surya', 'Hyderabad'), ('Chinnu', 'Pune')]
df = pd.DataFrame(data, columns=['Name', 'City'])
print(df)

Output:

     Name       City
0  Swathi      Vizag
1   Surya  Hyderabad
2  Chinnu       Pune

From a Dictionary

import pandas as pd

# Creating a DataFrame with your data
data = {
    'Name': ['Swathi', 'Surya', 'Chinnu'],
    'City': ['Vizag', 'Hyderabad', 'Pune'],
    'Age': [28, 32, 26],
    'Salary': [60000, 75000, 52000]
}

df = pd.DataFrame(data)
print(df)

Output:

     Name        City       Age  Salary
0  Swathi        Vizag     28   60000
1  Surya   Hyderabad   32   75000
2  Chinnu         Pune    26   52000

From an Excel File (.xlsx

df = pd.read_excel('data.xlsx')  # install openpyxl if not already: pip install openpyxl
print(df)

You need to have the Excel file (data.xlsx) in your directory, or provide the full path.

From a CSV File (.csv

df = pd.read_csv('data.csv')
print(df)

You need to have the CSV file (data.csv) in your directory, or provide the full path.

Essential Functionality

  • Viewing data
This helps in getting an overview of your DataFrame - like how many rows/columns it has, data types, null values, etc.

Function   Description
df.he ad(n)  Returns first n rows (default 5)
df.tail(n)  Returns last n rows
df.shape  Returns tuple of (rows, columns)
df.info()  Summary: non-null values, dtypes
df.describe()  Statistics for numeric columns
df.columns  List of column names
df.dtypes  Data types of each column

Program

import pandas as pd

data = {
    'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
    'Age': [25, 30, 35, 40, 28],
    'Salary': [50000, 60000, 65000, 70000, 52000]
}

df = pd.DataFrame(data)

print("Head:\n", df.head())
print("Shape:", df.shape)
print("Info:")
df.info()
print("Describe:\n", df.describe())

Output:

Head:
      Name  Age  Salary
0     Dev   25   50000
1  Swathi   30   60000
2 Charlie   35   65000
3  Chinnu   40   70000
4   Surya   28   52000

Shape: (5, 3)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   Salary  5 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes

Describe:
             Age        Salary
count   5.000000      5.000000
mean   31.600000  59400.000000
std     6.577297   8202.439620
min    25.000000  50000.000000
25%    28.000000  52000.000000
50%    30.000000  60000.000000
75%    35.000000  65000.000000
max    40.000000  70000.000000

  • Renaming and Deleting Labels
To rename a label (i.e., a column name or an index label) in a pandas DataFrame, you can use the rename() method. 
rename() does not modify the original DataFrame unless you set inplace=True.

Program:

import pandas as pd

data = {
    'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
    'Age': [25, 30, 35, 40, 28],
    'Salary': [50000, 60000, 65000, 70000, 52000]
}

df = pd.DataFrame(data)

df = df.rename(columns={'Salary': 'Income'}, inplace=True)

Output:

     Name  Age  Income
0     Dev   25   50000
1  Swathi   30   60000
2 Charlie   35   65000
3  Chinnu   40   70000
4   Surya   28   52000

To rename row index labels, use the index parameter:

Program:

df = df.rename(index={0: 'a', 1: 'b'}, inplace=True)

Output:

  Name  Age  Income
a  Dev   25   50000
b Swathi 30   60000
2 Charlie 35  65000
3 Chinnu 40   70000
4 Surya  28   52000

To delete labels (i.e., rows or columns) in a pandas DataFrame, you can use the drop() method.

Program:

import pandas as pd

data = {
    'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
    'Age': [25, 30, 35, 40, 28],
    'Salary': [50000, 60000, 65000, 70000, 52000]
}

df = pd.DataFrame(data)

# Delete the 'Salary' column
df = df.drop(columns=['Salary']) # without using inplace

# Delete the row with index 3
df = df.drop(index=3)

print(df)

Output:

     Name  Age
0     Dev   25
1  Swathi   30
2 Charlie   35
4   Surya   28

  • Handling Missing Data
Missing values (NaNs) can distort analysis and must be handled — either by removing or filling them.

Function Description
isnull() Detects missing values
dropna() Removes missing rows/columns
fillna() Fills missing values
interpolate() Fills values using interpolation

Program:

import pandas as pd
import numpy as np

data = {
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 2, 3, 4]
}

df = pd.DataFrame(data)
print("Original:\n", df)

# Fill missing with 0
df_filled = df.fillna(0)
print("\nFilled:\n", df_filled)

# Drop rows with NaNs
df_dropped = df.dropna()
print("\nDropped:\n", df_dropped)

Output:

Original:
      A    B
0  1.0  NaN
1  2.0  2.0
2  NaN  3.0
3  4.0  4.0

Filled:
      A    B
0  1.0  0.0
1  2.0  2.0
2  0.0  3.0
3  4.0  4.0

Dropped:
      A    B
1  2.0  2.0
3  4.0  4.0

  • Sorting Data
To sort data in a pandas DataFrame, use the sort_values(),sort_index() method. 

Function Description
sort_values() Sort by column values
sort_index() Sort by index
 
Program:

import pandas as pd

data = {
    'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
    'Age': [25, 30, 35, 40, 28],
    'Salary': [50000, 60000, 65000, 70000, 52000]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Sort by Salary in descending order
df_sorted = df.sort_values(by='Salary', ascending=False)

# Print sorted DataFrame
print("\nSorted DataFrame by Salary (descending):\n", df_sorted)

Output:

Original DataFrame:
      Name  Age  Salary
0     Dev   25   50000
1  Swathi   30   60000
2 Charlie   35   65000
3  Chinnu   40   70000
4   Surya   28   52000

Sorted DataFrame by Salary (descending):
      Name  Age  Salary
3  Chinnu   40   70000
2 Charlie   35   65000
1  Swathi   30   60000
4   Surya   28   52000
0     Dev   25   50000

You can sort multiple columns. Sort by Age (ascending), then Salary (descending).

Program:

import pandas as pd

data = {
    'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
    'Age': [25, 30, 30, 40, 30],
    'Salary': [50000, 60000, 55000, 70000, 52000]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Sort by Age (ascending), then Salary (descending)
df_sorted = df.sort_values(by=['Age', 'Salary'], ascending=[True, False])

print("\nSorted DataFrame by Age (asc) and Salary (desc):\n", df_sorted)

Output:

Original DataFrame:
      Name  Age  Salary
0     Dev   25   50000
1  Swathi   30   60000
2 Charlie   30   55000
3  Chinnu   40   70000
4   Surya   30   52000

Sorted DataFrame by Age (asc) and Salary (desc):
      Name  Age  Salary
0     Dev   25   50000
1  Swathi   30   60000
2 Charlie   30   55000
4   Surya   30   52000
3  Chinnu   40   70000

The data is first sorted by Age in ascending order.
For rows with the same Age, it then sorts by Salary in descending order.

  • Applying functions
You can apply functions to rows, columns, or each element using:
  • apply() – for rows/columns on series & DataFrames
  • map() – for Series
  • applymap() – for element-wise operations on DataFrames

Program: apply()

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [10, 20, 30]
})

# Apply sum row-wise (axis=1)
row_sums = df.apply(sum, axis=1)
print("Row sums:\n", row_sums)

# Apply max column-wise (axis=0)
col_max = df.apply(max, axis=0)
print("\nColumn max:\n", col_max)

Output:
Row sums:
0    11
1    22
2    33
dtype: int64

Column max:
A     3
B    30
dtype: int64

Program: map()

import pandas as pd

s = pd.Series([1, 2, 3])
squared = s.map(lambda x: x ** 2)
print("Squared Series:\n", squared)

Output:

Squared Series:
0     1
1     4
2     9
dtype: int64

Program: applymap()

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
})

new_df = df.applymap(lambda x: x * 10)
print("Element-wise multiplied DataFrame:\n", new_df)

Output:

Element-wise multiplied DataFrame:
    A   B
0  10  30
1  20  40

Summarizing and Computing Descriptive Statistics

1. Basic Summary: describe()

The describe() method gives you summary statistics of your numerical columns by default.

import pandas as pd

data = {
    'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'],
    'Age': [25, 30, 35, 40, 28],
    'Salary': [50000, 60000, 65000, 70000, 52000]
}

df = pd.DataFrame(data)

print(df.describe())

             Age        Salary
count   5.000000      5.000000
mean   31.600000  59400.000000
std     6.577297   8202.439620
min    25.000000  50000.000000
25%    28.000000  52000.000000
50%    30.000000  60000.000000
75%    35.000000  65000.000000
max    40.000000  70000.000000

Statistic Meaning
count Number of non-null values
mean Average value
std Standard deviation
min Minimum value
25% 1st quartile (25th percentile)
50% Median (2nd quartile, 50th percentile)
75% 3rd quartile (75th percentile)
max Maximum value

2. Statistics on Specific Columns

# Mean of Salary
print("Mean Salary:", df['Salary'].mean())

# Median of Age
print("Median Age:", df['Age'].median())

# Standard deviation
print("Salary Std Dev:", df['Salary'].std())

Output:

Mean Salary: 59400.0
Median Age: 30.0
Salary Std Dev: 8202.439620425219

Function Description Example
df.mean() Mean df['Age'].mean()
df.median() Median df['Salary'].median()
df.mode() Mode df['Age'].mode()
df.min() Minimum df['Salary'].min()
df.max() Maximum df['Salary'].max()
df.var() Variance df['Salary'].var()
df.std() Standard deviation df['Salary'].std()
df.count() Non-null count df.count()
df.sum() Total sum df['Salary'].sum()
df.skew() Skewness df.skew()
df.kurt() Kurtosis df.kurt()

3. groupby()

The groupby() function splits the data into groups based on some criteria (like a column), and then you can apply an aggregate function (like sum(), mean(), count()).

Program:
import pandas as pd

data = {
    'Department': ['HR', 'IT', 'HR', 'Finance', 'IT', 'Finance'],
    'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya', 'Kiran'],
    'Salary': [50000, 60000, 52000, 70000, 65000, 62000],
    'Age': [25, 30, 35, 40, 28, 32]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Group by one column and aggregate
salary_by_dept = df.groupby('Department')['Salary'].sum()
print("\nTotal Salary by Department:\n", salary_by_dept)

# Multiple Aggregations per Group
salary_stats = df.groupby('Department')['Salary'].agg(['count', 'mean', 'max', 'min'])
print("\nSalary Statistics by Department:\n", salary_stats)

# Group by Multiple Columns
grouped = df.groupby(['Department', 'Age'])['Salary'].mean()
print("\nSalary by Department and Age:\n", grouped)

# Apply Different Aggregations to Multiple Columns
summary = df.groupby('Department').agg({
    'Salary': ['mean', 'max'],
    'Age': 'median'
})

print("\nCustom Aggregation:\n", summary)


Output:

Total Salary by Department:
Department
Finance    132000
HR         102000
IT         125000
Name: Salary, dtype: int64

Salary Statistics by Department:
                     count     mean        max        min
Department                            
Finance          2          66000.0   70000    62000
HR                 2           51000.0   52000    50000
IT                   2           62500.0   65000    60000

Salary by Department and Age:
Department      Age
Finance         32     62000
                      40     70000
HR                 35     50000
                      35     52000
IT                   28     65000
                      30     60000
Name: Salary, dtype: int64

Custom Aggregation:
                        Salary             Age
                        mean        max median
Department                       
Finance          66000.0      70000     36
HR                  51000.0      52000     30
IT                    62500.0      65000     29

Data Loading, Storage, and File Formats

Reading and Writing Data in Text Format

The Pandas library in Python makes it extremely easy to handle structured text data such as CSV, TSV, and TXT files. Whether you are saving DataFrames to files or reading large datasets efficiently, Pandas provides powerful and user-friendly functions to manage text-based data. 

1. Writing Data to a Text File

The to_csv() function in Pandas allows you to write a DataFrame to a text file with a chosen delimiter.
import pandas as pd # Create a simple DataFrame data = { "Name": ["Swati", "Rahul", "Ananya"], "Age": [25, 30, 22], "City": ["Hyderabad", "Delhi", "Mumbai"] } df = pd.DataFrame(data) # Save to text file (CSV format with tab separator) df.to_csv("data.txt", sep="\t", index=False) print("Data written to data.txt")

In the code above:

  • The sep="\t" argument sets the delimiter to a tab character. You can also use "," for CSV files, " " for space-separated files, or any other custom delimiter.

  • The index=False argument prevents Pandas from writing the default row numbers to the file, resulting in a cleaner output.

The resulting file (data.txt) will contain:

Name Age City Swati 25 Hyderabad Rahul 30 Delhi Ananya 22 Mumbai

2. Reading Data from a Text File

To read the data back into a DataFrame, use the read_csv() function with the same separator that was used while writing the file.

df_read = pd.read_csv("data.txt", sep="\t") print(df_read)

This reads the contents of the tab-separated file into a DataFrame and prints it.

3. Writing Only Specific Columns

You can also write only selected columns from your DataFrame to a text file. For example:

df.to_csv("names.txt", columns=["Name", "City"], sep=" ", index=False)

This code saves only the "Name" and "City" columns using a space as the separator. The output will look like:

Name City Swati Hyderabad Rahul Delhi Ananya Mumbai

This is useful when you don’t need to save the entire dataset.

4. Reading Large Text Files in Chunks

When working with very large text files, loading the entire file at once may not be memory-efficient. Pandas allows you to read large files in chunks using the chunksize parameter.

for chunk in pd.read_csv("data.txt", sep="\t", chunksize=2): print(chunk)

This code reads the file in chunks of 2 rows at a time and prints each chunk. This method helps reduce memory usage when dealing with large datasets.

Output:
Name Age City 0 Swati 25 Hyderabad 1 Rahul 30 Delhi Name Age City 2 Ananya 22 Mumbai

Each chunk is a smaller DataFrame that can be processed independently.


5. Exporting Data to Other Text Formats

You can export data to different text-based formats simply by changing the delimiter in the to_csv() function.

Writing to a CSV File (Comma-Separated)

df.to_csv("data.csv", index=False)

This saves the file in the common CSV format where columns are separated by commas.

Output:

Name,Age,City Swati,25,Hyderabad Rahul,30,Delhi Ananya,22,Mumbai

Writing to a TSV File (Tab-Separated)

df.to_csv("data.tsv", sep="\t", index=False)

This saves the file using tabs to separate columns.

Output:

Name Age City Swati 25 Hyderabad Rahul 30 Delhi Ananya 22 Mumbai

Writing to a File with a Custom Delimiter (e.g., Pipe |)

df.to_csv("data_pipe.txt", sep="|", index=False)

This will create a file where columns are separated by the pipe (|) character.

Output:

Name|Age|City Swati|25|Hyderabad Rahul|30|Delhi Ananya|22|Mumbai

6. Reading and Writing JSON Data

In addition to working with CSV and text files, Pandas also provides full support for JSON (JavaScript Object Notation) — a popular format for APIs and web data.

Writing Data to a JSON File

You can easily convert a DataFrame into a JSON file using the to_json() method.

df.to_json("data.json", orient="records", lines=True)
  • orient="records": Each row is converted into a separate JSON object (dictionary).
  • lines=True: Each JSON object is written on a new line (known as JSON Lines format), which is efficient for large files and commonly used in data pipelines.

Output in data.json:

{"Name":"Swati","Age":25,"City":"Hyderabad"} {"Name":"Rahul","Age":30,"City":"Delhi"} {"Name":"Ananya","Age":22,"City":"Mumbai"}

You can also write the entire file as a single JSON object (array of records):

df.to_json("data_array.json", orient="records", lines=False)

This will produce:

[ {"Name":"Swati","Age":25,"City":"Hyderabad"}, {"Name":"Rahul","Age":30,"City":"Delhi"}, {"Name":"Ananya","Age":22,"City":"Mumbai"} ]

Reading Data from a JSON File

To read a JSON file into a DataFrame, use pd.read_json():

# For standard JSON array df_json = pd.read_json("data_array.json") print(df_json)

If the JSON file uses the JSON Lines format, include the lines=True argument:

df_jsonl = pd.read_json("data.json", lines=True) print(df_jsonl)

7. Reading and Writing TSV Files (Tab-Separated Values)

TSV (Tab-Separated Values) files are very similar to CSV files, with the key difference being that they use a tab character (\t) instead of a comma to separate fields.
This format is especially useful in domains such as bioinformatics, natural language processing, and spreadsheet exports where the data may already contain commas, making the CSV format less reliable.
To write a DataFrame to a TSV file using Pandas, you can use the to_csv() function with the sep="\t" argument.
For example:
df.to_csv("data.tsv", sep="\t", index=False)
In this code, sep="\t" tells Pandas to use a tab character as the field separator, and index=False prevents the index column from being written to the file.
The output in the data.tsv file will look like this:
Name Age City Swati 25 Hyderabad Rahul 30 Delhi Ananya 22 Mumbai
This format is clean, easy to read, and more robust than CSV when your data contains commas within fields.
To read a TSV file back into a DataFrame, you can use the read_csv() function with the same sep="\t" argument:
df_tsv = pd.read_csv("data.tsv", sep="\t") print(df_tsv)

This command correctly interprets the tab-separated values in the file and loads them into a DataFrame with the appropriate column structure.

Binary Formats

Pandas supports efficient binary file formats such as Pickle and HDF5.
These formats are faster to read and write, preserve data types, and are better suited for handling large datasets in memory or on disk.

Pickle Format

Pickle is Python’s built-in binary serialization format.
It is used to save and load Python objects — including Pandas DataFrames — in a compact, byte-stream format.
Pickle files are Python-specific, which means they may not be compatible across different programming languages or platforms.

Save a DataFrame as a Pickle File

To save a DataFrame as a Pickle file, use the to_pickle() method:
df.to_pickle("data.pkl")
This command serializes the DataFrame object and writes it to a file called data.pkl. It saves everything — including column data types, index labels, and metadata — exactly as it is.

Read a Pickle File

To read the Pickle file back into a DataFrame, use the read_pickle() function:
df_loaded = pd.read_pickle("data.pkl") print(df_loaded)
This deserializes the binary data and restores the original DataFrame structure and content.
The process is very fast and does not require any additional formatting or parsing.

When to Use Pickle?

  • When working entirely within a Python environment
  • When you want a quick and exact snapshot of a DataFrame
  • Not suitable if the data needs to be shared with other tools or languages

Example:

import pandas as pd

# Create DataFrame data_pickle = { "Name": ["Swathi", "Surya", "Chinnu"], "Age": [23, 27, 21], "City": ["Hyderabad", "Chennai", "Bangalore"] } df_pickle = pd.DataFrame(data_pickle) print("DataFrame for Pickle:") print(df_pickle)

# Save df_pickle to a Pickle file df_pickle.to_pickle("students_info.pkl") print("DataFrame saved to Pickle as 'students_info.pkl'")

# Load the Pickle file df_pickle_loaded = pd.read_pickle("students_info.pkl") print("DataFrame loaded from Pickle:") print(df_pickle_loaded)


Output:

DataFrame for Pickle: Name Age City 0 Swathi 23 Hyderabad 1 Surya 27 Chennai 2 Chinnu 21 Bangalore

DataFrame saved to Pickle as 'students_info.pkl

DataFrame loaded from Pickle: Name Age City 0 Swathi 23 Hyderabad 1 Surya 27 Chennai 2 Chinnu 21 Bangalore


HDF5 Format

HDF5 (Hierarchical Data Format version 5) is a binary file format designed to store and organize large amounts of numerical data.
It supports hierarchical storage (like folders within folders), compression, and selective data access.
Pandas provides built-in support for HDF5 using the PyTables library, which must be installed (pip install tables).

Save a DataFrame to an HDF5 File

To save a DataFrame to an HDF5 file, use the to_hdf() method:
df.to_hdf("data.h5", key="mydata", mode="w")
  • "data.h5" is the name of the HDF5 file.
  • key="mydata" defines the name (or path) under which the DataFrame is stored.
  • mode="w" specifies that the file should be written from scratch (overwrite mode).

Read from an HDF5 File

To read the HDF5 file and restore the DataFrame, use the read_hdf() function:
df_hdf = pd.read_hdf("data.h5", key="mydata") print(df_hdf)

This reads the specific dataset stored under the key "mydata" from the HDF5 file and loads it into a DataFrame.

When to Use HDF5?

  • When dealing with large datasets that don’t fit entirely in memory
  • When you need random access to parts of the data (e.g., specific rows or columns)
  • When you want to store multiple datasets in a single file with a hierarchical structure
  • Suitable for scientific computing, time series, and numerical data storage

Example:

import pandas as pd

# Create DataFrame data_hdf = { "Student": ["Swathi", "Surya", "Chinnu"], "Marks": [85, 92, 78], "Grade": ["A", "A+", "B"] } df_hdf = pd.DataFrame(data_hdf) print("DataFrame for HDF5:") print(df_hdf)

# Save df_hdf to an HDF5 file df_hdf.to_hdf("student_grades.h5", key="grades", mode="w") print("DataFrame saved to HDF5 as 'student_grades.h5'")

# Load the HDF5 file df_hdf_loaded = pd.read_hdf("student_grades.h5", key="grades") print("DataFrame loaded from HDF5:") print(df_hdf_loaded)

Output:

DataFrame for HDF5: Student Marks Grade 0 Swathi 85 A 1 Surya 92 A+ 2 Chinnu 78 B

DataFrame saved to HDF5 as 'student_grades.h5'

DataFrame loaded from HDF5: Student Marks Grade 0 Swathi 85 A 1 Surya 92 A+ 2 Chinnu 78 B

Both Pickle and HDF5 are powerful options when working with large or complex datasets in Pandas.
Choose Pickle for quick, Python-only tasks and HDF5 for scalable, multi-dataset storage in numerical or scientific computing.

Interacting with Web APIs

Modern data workflows often require fetching data from the web — especially through APIs (Application Programming Interfaces).
APIs allow you to access data provided by external services in a structured format like JSON.

Using APIs with Pandas offers several powerful advantages, especially for data analysis and automation:
  1. Real-time data access: APIs allow you to fetch the most up-to-date information directly from a source. Whether it's weather data, stock prices, or social media feeds, you no longer have to rely on outdated or static files.
  2. No manual downloads: Instead of downloading CSV or Excel files manually from websites or dashboards, APIs let you programmatically retrieve data in real-time. This reduces manual effort and improves productivity.
  3. Seamless integration with Pandas: Most APIs return data in JSON format, which can be easily converted into a Pandas DataFrame. This means you can go directly from API to analysis, skipping time-consuming data preparation steps.
  4. Automated workflows: You can set up scheduled scripts or data pipelines that fetch API data daily, hourly, or in real time, feeding directly into dashboards, models, or reports.
Let’s go step-by-step through the process of connecting to a public API, retrieving data, and converting it into a structured format using Pandas.

Step 1: Import the Required Libraries

First, import the necessary Python libraries.
The requests library is used to send HTTP requests, while pandas is used for data manipulation.
        import requests
         import pandas as pd

Step 2: Send a Request to the API

Next, define the API endpoint you want to access.
In this example, we are using a public testing API called JSONPlaceholder, which returns fake blog post data.

        url = "https://jsonplaceholder.typicode.com/posts"         response = requests.get(url) # Send a GET request to the API

The requests.get() function sends a request to the API URL and stores the response in a variable called response.

Step 3: Check the Response Status Code

        print(response.status_code)

If the output is 200, it means the request was successful and the API returned the data as expected.
Other codes like 404 or 500 indicate errors such as "Not Found" or "Server Error."

Step 4: Convert the JSON Response into a Python Object

Most APIs return data in JSON format.
To work with it in Python, you need to convert this JSON into a Python dictionary or list.

        data = response.json()

The response.json() method parses the returned JSON content and converts it into native Python data structures (usually a list of dictionaries).

Step 5: Load the Data into a Pandas DataFrame

Now that the data is in Python format, you can easily load it into a Pandas DataFrame:

        df = pd.DataFrame(data)

This step converts the list of dictionaries (each dictionary representing a record) into a tabular structure that Pandas can work with.
Each key in the dictionary becomes a column in the DataFrame.

Interacting with Databases

In data science projects, structured data is often stored in databases.
Analysts and data scientists frequently need to connect to these databases to extract, query, and analyze data using Python.
The pandas library, in combination with SQLAlchemy, allows seamless integration between Python and SQL databases.
This enables users to run SQL queries and load results directly into Pandas DataFrames.

SQL databases play a crucial role in managing large volumes of structured data.
They are used across industries such as finance, healthcare, e-commerce, and telecommunications for the following reasons:
  • SQL databases provide efficient data storage and retrieval mechanisms.
  • They allow querying large datasets quickly and precisely.
  • Analysts can work with both historical and real-time data for better decision-making.
  • Databases ensure data integrity, security, and scalability — which is essential for enterprise-level applications.
To work with SQL databases in Pandas, you’ll need to install the following libraries:
        pip install pandas sqlalchemy

Step 1: Connect to an SQL Database

You can connect to a database using SQLAlchemy's create_engine() function.

from sqlalchemy import create_engine # Creating a connection to a SQLite database engine = create_engine('sqlite:///my_database.db')

For other databases, your connection string would change. For example:
  • MySQL: 'mysql+pymysql://user:password@localhost/dbname'
  • PostgreSQL: 'postgresql://user:password@localhost/dbname'
Step 2: Query and Load Data into Pandas
Once the connection is established, you can use SQL queries to fetch data from the database and load it into a Pandas DataFrame for analysis.

import pandas as pd # Define a SQL query to fetch data sql_query = 'SELECT * FROM your_table_name' # Execute the query and store results in a DataFrame df = pd.read_sql(sql_query, con=engine) print("DataFrame from SQL Database:") print(df.head())

Step 3: Executing SQL Commands from Python

You can also create new tables, insert data, or modify the schema directly using SQL statements.
For example, to create a new table:

create_table_sql = ''' CREATE TABLE IF NOT EXISTS new_table ( id INTEGER PRIMARY KEY, name TEXT, age INTEGER ) ''' # Execute the command using a database connection with engine.connect() as connection: connection.execute(create_table_sql)

This code will create a new table named new_table with three columns: id, name, and age.

Example:

import pandas as pd from sqlalchemy import create_engine, text # Step 1: Create SQLite database connection using SQLAlchemy engine = create_engine('sqlite:///people.db') # Step 2: Create a new table (if not exists) create_table_sql = ''' CREATE TABLE IF NOT EXISTS people ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, age INTEGER, city TEXT ) ''' with engine.connect() as connection: connection.execute(text(create_table_sql)) # Step 3: Insert sample data (Swathi, Surya, Chinnu) insert_data_sql = ''' INSERT INTO people (name, age, city) VALUES ('Swathi', 28, 'Hyderabad'), ('Surya', 25, 'Chennai'), ('Chinnu', 22, 'Bangalore') ''' with engine.connect() as connection: connection.execute(text(insert_data_sql)) # Step 4: Query the data using Pandas df = pd.read_sql('SELECT * FROM people', con=engine) # Step 5: Display the DataFrame print("Data loaded from SQL database into Pandas:") print(df)

Output:

Data loaded from SQL database into Pandas: id name age city 0 1 Swathi 28 Hyderabad 1 2 Surya 25 Chennai 2 3 Chinnu 22 Bangalore






Comments

Popular posts from this blog

Introduction to Data Analysis & Python

Fundamentals of Python Programming & Numpy