Introduction to Pandas and Data Loading
What are pandas?
- Series → A one-dimensional array-like object. It’s like a single column from a DataFrame.
- DataFrame → This is the main structure in Pandas, similar to a table or spreadsheet. It consists of rows and columns, and you can easily manipulate, filter, and analyze data within it.
- Handling missing data easily.
- Data filtering, sorting, grouping, and aggregation.
- Merging and joining datasets.
- Time-series functionality (date parsing, resampling, etc.).
- Built on NumPy, so it’s optimized for performance.
- Vectorized operations for fast computation.
- Works well with libraries like NumPy, Matplotlib, and Scikit-learn.
- Selecting rows and columns using labels or indices
- Conditional selection using boolean indexing
- You can import/export data from various formats like CSV, Excel, SQL databases, JSON, HTML, etc.
- Pandas provides fast, flexible, and expressive data structures (DataFrames and Series), which are optimized for performance. These structures allow you to perform complex data manipulation quickly and efficiently.
- The syntax is intuitive, making it easy to learn for beginners and efficient for experienced users. The API is highly user-friendly, allowing you to perform data manipulations with minimal code.
- Pandas provides built-in methods for dealing with missing data (NaN), such as filling, dropping, or forward/backward filling, which is critical for real-world data analysis.
- Pandas can read and write data from a variety of formats: CSV, Excel, JSON, HDF5, SQL databases, and more. This makes it a great choice for integrating different data sources.
- Pandas integrates seamlessly with other Python libraries like NumPy (for numerical computations), Matplotlib/Seaborn (for visualization), and Scikit-learn (for machine learning), allowing for a full data science workflow.
- While Pandas is fast and powerful, it can be memory-intensive, especially when working with large datasets. The DataFrame structure can consume a lot of RAM, which can lead to performance issues on very large datasets (over a few GBs).
- While basic operations in Pandas are easy to learn, more advanced features (like multi-indexing, pivoting, or complex aggregation) can be tricky for newcomers to grasp.
- While Pandas is highly optimized, some operations (like string manipulations, or using
apply()with custom functions) can be slower than vectorized operations with NumPy or more specialized libraries.
- Exploratory Data Analysis (EDA): Pandas is commonly used for EDA to analyze datasets, calculate statistics (mean, median, mode), and understand the distribution of data.
- Data Summarization: It’s used for calculating summary statistics like average, standard deviation, etc., as well as grouping data by categories for aggregation.
- Handling Missing Data: Removing, filling, or interpolating missing values in datasets.
- Data Transformation: Renaming columns, changing data types, applying functions across columns/rows.
- Dealing with Outliers: Detecting and handling outliers in datasets.
- Data Preprocessing: Before applying machine learning algorithms, data often needs to be cleaned, transformed, and standardized. Pandas is a key tool in preparing data for machine learning models.
- Feature Engineering: Creating new features from existing data (e.g., time-based features, aggregating data) is done with Pandas.
Pandas Data Structures
-
Homogeneous data (all elements are of the same type)
-
Associated labels (index)
-
Can be thought of like a column in a spreadsheet or SQL table
- Arrays
- Lists
- Dict
- Left column is index
- Right column is data
- Heterogeneous data (each column can have a different data type)
- Labeled axes (rows and columns)
- Flexible indexing and powerful data manipulation
- List
- List of tuples
- Dictionary
- Excel Spreadsheet files
- csv (common separated values) files
.xlsx) data.xlsx) in your directory, or provide the full path.data.csv) in your directory, or provide the full path.Essential Functionality
- Viewing data
This helps in getting an overview of your DataFrame - like how many rows/columns it has, data types, null values, etc.
Function
Description
df.he ad(n)
Returns first n rows (default 5)
df.tail(n)
Returns last n rows
df.shape
Returns tuple of (rows, columns)
df.info()
Summary: non-null values, dtypes
df.describe()
Statistics for numeric columns
df.columns
List of column names
df.dtypes
Data types of each column
Program
import pandas as pd
data = { 'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'], 'Age': [25, 30, 35, 40, 28], 'Salary': [50000, 60000, 65000, 70000, 52000]}
df = pd.DataFrame(data)
print("Head:\n", df.head())print("Shape:", df.shape)print("Info:")df.info()print("Describe:\n", df.describe())
Output:
Head: Name Age Salary0 Dev 25 500001 Swathi 30 600002 Charlie 35 650003 Chinnu 40 700004 Surya 28 52000
Shape: (5, 3)
Info:<class 'pandas.core.frame.DataFrame'>RangeIndex: 5 entries, 0 to 4Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 5 non-null object 1 Age 5 non-null int64 2 Salary 5 non-null int64 dtypes: int64(2), object(1)memory usage: 248.0+ bytes
Describe: Age Salarycount 5.000000 5.000000mean 31.600000 59400.000000std 6.577297 8202.439620min 25.000000 50000.00000025% 28.000000 52000.00000050% 30.000000 60000.00000075% 35.000000 65000.000000max 40.000000 70000.000000
- Renaming and Deleting Labels
To rename a label (i.e., a column name or an index label) in a pandas DataFrame, you can use the rename() method. rename() does not modify the original DataFrame unless you set inplace=True.
Program:
import pandas as pd
data = { 'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'], 'Age': [25, 30, 35, 40, 28], 'Salary': [50000, 60000, 65000, 70000, 52000]}
df = pd.DataFrame(data)
df = df.rename(columns={'Salary': 'Income'}, inplace=True)
Output:
Name Age Income0 Dev 25 500001 Swathi 30 600002 Charlie 35 650003 Chinnu 40 700004 Surya 28 52000
To rename row index labels, use the index parameter:
| Function | Description |
|---|---|
df.he ad(n) |
Returns first n rows (default 5) |
df.tail(n) |
Returns last n rows |
df.shape |
Returns tuple of (rows, columns) |
df.info() |
Summary: non-null values, dtypes |
df.describe() |
Statistics for numeric columns |
df.columns |
List of column names |
df.dtypes |
Data types of each column |
rename() method. rename() does not modify the original DataFrame unless you set inplace=True.index parameter:Program:
df = df.rename(index={0: 'a', 1: 'b'}, inplace=True)
Output:
Name Age Incomea Dev 25 50000b Swathi 30 600002 Charlie 35 650003 Chinnu 40 700004 Surya 28 52000
To delete labels (i.e., rows or columns) in a pandas DataFrame, you can use the drop() method.
drop() method.Program:
import pandas as pd
data = { 'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'], 'Age': [25, 30, 35, 40, 28], 'Salary': [50000, 60000, 65000, 70000, 52000]}
df = pd.DataFrame(data)
# Delete the 'Salary' columndf = df.drop(columns=['Salary']) # without using inplace
# Delete the row with index 3df = df.drop(index=3)
print(df)
Output:
Name Age0 Dev 251 Swathi 302 Charlie 354 Surya 28
- Handling Missing Data
Missing values (NaNs) can distort analysis and must be handled — either by removing or filling them.
Function
Description
isnull()
Detects missing values
dropna()
Removes missing rows/columns
fillna()
Fills missing values
interpolate()
Fills values using interpolation
| Function | Description |
|---|---|
isnull() |
Detects missing values |
dropna() |
Removes missing rows/columns |
fillna() |
Fills missing values |
interpolate() |
Fills values using interpolation |
Program:
import pandas as pdimport numpy as np
data = { 'A': [1, 2, np.nan, 4], 'B': [np.nan, 2, 3, 4]}
df = pd.DataFrame(data)print("Original:\n", df)
# Fill missing with 0df_filled = df.fillna(0)print("\nFilled:\n", df_filled)
# Drop rows with NaNsdf_dropped = df.dropna()print("\nDropped:\n", df_dropped)
Output:
Original: A B0 1.0 NaN1 2.0 2.02 NaN 3.03 4.0 4.0
Filled: A B0 1.0 0.01 2.0 2.02 0.0 3.03 4.0 4.0
Dropped: A B1 2.0 2.03 4.0 4.0
- Sorting Data
To sort data in a pandas DataFrame, use the sort_values(),sort_index() method.
Function
Description
sort_values()
Sort by column values
sort_index()
Sort by index
Program:
import pandas as pd
data = { 'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'], 'Age': [25, 30, 35, 40, 28], 'Salary': [50000, 60000, 65000, 70000, 52000]}
df = pd.DataFrame(data)print("Original DataFrame:\n", df)
# Sort by Salary in descending orderdf_sorted = df.sort_values(by='Salary', ascending=False)
# Print sorted DataFrameprint("\nSorted DataFrame by Salary (descending):\n", df_sorted)
Output:
Original DataFrame: Name Age Salary0 Dev 25 500001 Swathi 30 600002 Charlie 35 650003 Chinnu 40 700004 Surya 28 52000
Sorted DataFrame by Salary (descending): Name Age Salary3 Chinnu 40 700002 Charlie 35 650001 Swathi 30 600004 Surya 28 520000 Dev 25 50000
You can sort multiple columns. Sort by Age (ascending), then Salary (descending).
Program:
import pandas as pd
data = { 'Name': ['Dev', 'Swathi', 'Charlie', 'Chinnu', 'Surya'], 'Age': [25, 30, 30, 40, 30], 'Salary': [50000, 60000, 55000, 70000, 52000]}
df = pd.DataFrame(data)print("Original DataFrame:\n", df)
# Sort by Age (ascending), then Salary (descending)df_sorted = df.sort_values(by=['Age', 'Salary'], ascending=[True, False])
print("\nSorted DataFrame by Age (asc) and Salary (desc):\n", df_sorted)
sort_values(),sort_index() method. | Function | Description |
|---|---|
sort_values() |
Sort by column values |
sort_index() |
Sort by index |
Program:
Output:
Original DataFrame: Name Age Salary0 Dev 25 500001 Swathi 30 600002 Charlie 30 550003 Chinnu 40 700004 Surya 30 52000
Sorted DataFrame by Age (asc) and Salary (desc): Name Age Salary0 Dev 25 500001 Swathi 30 600002 Charlie 30 550004 Surya 30 520003 Chinnu 40 70000
The data is first sorted by Age in ascending order.For rows with the same Age, it then sorts by Salary in descending order.
- Applying functions
You can apply functions to rows, columns, or each element using:apply() – for rows/columns on series & DataFramesmap() – for Seriesapplymap() – for element-wise operations on DataFrames
apply()– for rows/columns on series & DataFramesmap()– for Seriesapplymap()– for element-wise operations on DataFrames
Program: apply()
import pandas as pd
df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [10, 20, 30]})
# Apply sum row-wise (axis=1)row_sums = df.apply(sum, axis=1)print("Row sums:\n", row_sums)
# Apply max column-wise (axis=0)col_max = df.apply(max, axis=0)print("\nColumn max:\n", col_max)
Output:Row sums:0 111 222 33dtype: int64
Column max:A 3B 30dtype: int64
Program: map()
import pandas as pd
s = pd.Series([1, 2, 3])squared = s.map(lambda x: x ** 2)print("Squared Series:\n", squared)
Output:
Squared Series:0 11 42 9dtype: int64
Program: applymap()
import pandas as pd
df = pd.DataFrame({ 'A': [1, 2], 'B': [3, 4]})
new_df = df.applymap(lambda x: x * 10)print("Element-wise multiplied DataFrame:\n", new_df)
Output:
Element-wise multiplied DataFrame: A B0 10 301 20 40
Summarizing and Computing Descriptive Statistics
1. Basic Summary: describe()
describe() method gives you summary statistics of your numerical columns by default.| Statistic | Meaning |
|---|---|
count |
Number of non-null values |
mean |
Average value |
std |
Standard deviation |
min |
Minimum value |
25% |
1st quartile (25th percentile) |
50% |
Median (2nd quartile, 50th percentile) |
75% |
3rd quartile (75th percentile) |
max |
Maximum value |
2. Statistics on Specific Columns
| Function | Description | Example |
|---|---|---|
df.mean() |
Mean | df['Age'].mean() |
df.median() |
Median | df['Salary'].median() |
df.mode() |
Mode | df['Age'].mode() |
df.min() |
Minimum | df['Salary'].min() |
df.max() |
Maximum | df['Salary'].max() |
df.var() |
Variance | df['Salary'].var() |
df.std() |
Standard deviation | df['Salary'].std() |
df.count() |
Non-null count | df.count() |
df.sum() |
Total sum | df['Salary'].sum() |
df.skew() |
Skewness | df.skew() |
df.kurt() |
Kurtosis | df.kurt() |
3. groupby()
groupby() function splits the data into groups based on some criteria (like a column), and then you can apply an aggregate function (like sum(), mean(), count()).Data Loading, Storage, and File Formats
Reading and Writing Data in Text Format
1. Writing Data to a Text File
to_csv() function in Pandas allows you to write a DataFrame to a text file with a chosen delimiter.
In the code above:
-
The
sep="\t"argument sets the delimiter to a tab character. You can also use","for CSV files," "for space-separated files, or any other custom delimiter. -
The
index=Falseargument prevents Pandas from writing the default row numbers to the file, resulting in a cleaner output.
The resulting file (data.txt) will contain:
2. Reading Data from a Text File
To read the data back into a DataFrame, use the read_csv() function with the same separator that was used while writing the file.
This reads the contents of the tab-separated file into a DataFrame and prints it.
3. Writing Only Specific Columns
You can also write only selected columns from your DataFrame to a text file. For example:
This code saves only the "Name" and "City" columns using a space as the separator. The output will look like:
This is useful when you don’t need to save the entire dataset.
4. Reading Large Text Files in Chunks
When working with very large text files, loading the entire file at once may not be memory-efficient. Pandas allows you to read large files in chunks using the chunksize parameter.
This code reads the file in chunks of 2 rows at a time and prints each chunk. This method helps reduce memory usage when dealing with large datasets.
Each chunk is a smaller DataFrame that can be processed independently.
5. Exporting Data to Other Text Formats
You can export data to different text-based formats simply by changing the delimiter in the to_csv() function.
Writing to a CSV File (Comma-Separated)
This saves the file in the common CSV format where columns are separated by commas.
Output:
Writing to a TSV File (Tab-Separated)
This saves the file using tabs to separate columns.
Output:
Writing to a File with a Custom Delimiter (e.g., Pipe |)
This will create a file where columns are separated by the pipe (|) character.
Output:
6. Reading and Writing JSON Data
In addition to working with CSV and text files, Pandas also provides full support for JSON (JavaScript Object Notation) — a popular format for APIs and web data.
Writing Data to a JSON File
You can easily convert a DataFrame into a JSON file using the to_json() method.
-
orient="records": Each row is converted into a separate JSON object (dictionary). -
lines=True: Each JSON object is written on a new line (known as JSON Lines format), which is efficient for large files and commonly used in data pipelines.
Output in data.json:
You can also write the entire file as a single JSON object (array of records):
This will produce:
Reading Data from a JSON File
To read a JSON file into a DataFrame, use pd.read_json():
If the JSON file uses the JSON Lines format, include the lines=True argument:
Comments
Post a Comment