On this tutorial, we delve into Modin, a robust drop-in alternative for Pandas that leverages parallel computing to hurry up knowledge workflows considerably. By importing modin.pandas as pd, we remodel our pandas code right into a distributed computation powerhouse. Our aim right here is to know how Modin performs throughout real-world knowledge operations, similar to groupby, joins, cleansing, and time collection evaluation, all whereas working on Google Colab. We benchmark every job in opposition to the usual Pandas library to see how a lot sooner and extra memory-efficient Modin will be.
!pip set up "modin[ray]" -q
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import time
import os
from typing import Dict, Any
import modin.pandas as mpd
import ray
ray.init(ignore_reinit_error=True, num_cpus=2)
print(f"Ray initialized with {ray.cluster_resources()}")
We start by putting in Modin with the Ray backend, which allows parallelized pandas operations seamlessly in Google Colab. We suppress pointless warnings to maintain the output clear and clear. Then, we import all essential libraries and initialize Ray with 2 CPUs, getting ready the environment for distributed DataFrame processing.
def benchmark_operation(pandas_func, modin_func, knowledge, operation_name: str) -> Dict[str, Any]:
"""Evaluate pandas vs modin efficiency"""
start_time = time.time()
pandas_result = pandas_func(knowledge['pandas'])
pandas_time = time.time() - start_time
start_time = time.time()
modin_result = modin_func(knowledge['modin'])
modin_time = time.time() - start_time
speedup = pandas_time / modin_time if modin_time > 0 else float('inf')
print(f"n{operation_name}:")
print(f" Pandas: {pandas_time:.3f}s")
print(f" Modin: {modin_time:.3f}s")
print(f" Speedup: {speedup:.2f}x")
return {
'operation': operation_name,
'pandas_time': pandas_time,
'modin_time': modin_time,
'speedup': speedup
}
We outline a benchmark_operation operate to match the execution time of a selected job utilizing each pandas and Modin. By working every operation and recording its period, we calculate the speedup Modin presents. This offers us with a transparent and measurable solution to consider efficiency beneficial properties for every operation we check.
def create_large_dataset(rows: int = 1_000_000):
"""Generate artificial dataset for testing"""
np.random.seed(42)
knowledge = {
'customer_id': np.random.randint(1, 50000, rows),
'transaction_amount': np.random.exponential(50, rows),
'class': np.random.alternative(['Electronics', 'Clothing', 'Food', 'Books', 'Sports'], rows),
'area': np.random.alternative(['North', 'South', 'East', 'West'], rows),
'date': pd.date_range('2020-01-01', intervals=rows, freq='H'),
'is_weekend': np.random.alternative([True, False], rows, p=[0.3, 0.7]),
'ranking': np.random.uniform(1, 5, rows),
'amount': np.random.poisson(3, rows) + 1,
'discount_rate': np.random.beta(2, 5, rows),
'age_group': np.random.alternative(['18-25', '26-35', '36-45', '46-55', '55+'], rows)
}
pandas_df = pd.DataFrame(knowledge)
modin_df = mpd.DataFrame(knowledge)
print(f"Dataset created: {rows:,} rows × {len(knowledge)} columns")
print(f"Reminiscence utilization: {pandas_df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
return {'pandas': pandas_df, 'modin': modin_df}
dataset = create_large_dataset(500_000)
print("n" + "="*60)
print("ADVANCED MODIN OPERATIONS BENCHMARK")
print("="*60)
We outline a create_large_dataset operate to generate a wealthy artificial dataset with 500,000 rows that mimics real-world transactional knowledge, together with buyer data, buy patterns, and timestamps. We create each pandas and Modin variations of this dataset so we will benchmark them facet by facet. After producing the info, we show its dimensions and reminiscence footprint, setting the stage for superior Modin operations.
def complex_groupby(df):
return df.groupby(['category', 'region']).agg({
'transaction_amount': ['sum', 'mean', 'std', 'count'],
'ranking': ['mean', 'min', 'max'],
'amount': 'sum'
}).spherical(2)
groupby_results = benchmark_operation(
complex_groupby, complex_groupby, dataset, "Complicated GroupBy Aggregation"
)
We outline a complex_groupby operate to carry out multi-level groupby operations on the dataset by grouping it by class and area. We then combination a number of columns utilizing features like sum, imply, normal deviation, and rely. Lastly, we benchmark this operation on each pandas and Modin to measure how a lot sooner Modin executes such heavy groupby aggregations.
def advanced_cleaning(df):
df_clean = df.copy()
Q1 = df_clean['transaction_amount'].quantile(0.25)
Q3 = df_clean['transaction_amount'].quantile(0.75)
IQR = Q3 - Q1
df_clean = df_clean[
(df_clean['transaction_amount'] >= Q1 - 1.5 * IQR) &
(df_clean['transaction_amount'] <= Q3 + 1.5 * IQR)
]
df_clean['transaction_score'] = (
df_clean['transaction_amount'] * df_clean['rating'] * df_clean['quantity']
)
df_clean['is_high_value'] = df_clean['transaction_amount'] > df_clean['transaction_amount'].median()
return df_clean
cleaning_results = benchmark_operation(
advanced_cleaning, advanced_cleaning, dataset, "Superior Information Cleansing"
)
We outline the advanced_cleaning operate to simulate a real-world knowledge preprocessing pipeline. First, we take away outliers utilizing the IQR technique to make sure cleaner insights. Then, we carry out characteristic engineering by creating a brand new metric known as transaction_score and labeling high-value transactions. Lastly, we benchmark this cleansing logic utilizing each pandas and Modin to see how they deal with complicated transformations on giant datasets.
def time_series_analysis(df):
df_ts = df.copy()
df_ts = df_ts.set_index('date')
daily_sum = df_ts.groupby(df_ts.index.date)['transaction_amount'].sum()
daily_mean = df_ts.groupby(df_ts.index.date)['transaction_amount'].imply()
daily_count = df_ts.groupby(df_ts.index.date)['transaction_amount'].rely()
daily_rating = df_ts.groupby(df_ts.index.date)['rating'].imply()
daily_stats = sort(df)({
'transaction_sum': daily_sum,
'transaction_mean': daily_mean,
'transaction_count': daily_count,
'rating_mean': daily_rating
})
daily_stats['rolling_mean_7d'] = daily_stats['transaction_sum'].rolling(window=7).imply()
return daily_stats
ts_results = benchmark_operation(
time_series_analysis, time_series_analysis, dataset, "Time Sequence Evaluation"
)
We outline the time_series_analysis operate to discover every day tendencies by resampling transaction knowledge over time. We set the date column because the index, compute every day aggregations like sum, imply, rely, and common ranking, and compile them into a brand new DataFrame. To seize longer-term patterns, we additionally add a 7-day rolling common. Lastly, we benchmark this time collection pipeline with each pandas and Modin to match their effectivity on temporal knowledge.
def create_lookup_data():
"""Create lookup tables for joins"""
categories_data = {
'class': ['Electronics', 'Clothing', 'Food', 'Books', 'Sports'],
'commission_rate': [0.15, 0.20, 0.10, 0.12, 0.18],
'target_audience': ['Tech Enthusiasts', 'Fashion Forward', 'Food Lovers', 'Readers', 'Athletes']
}
regions_data = {
'area': ['North', 'South', 'East', 'West'],
'tax_rate': [0.08, 0.06, 0.09, 0.07],
'shipping_cost': [5.99, 4.99, 6.99, 5.49]
}
return {
'pandas': {
'classes': pd.DataFrame(categories_data),
'areas': pd.DataFrame(regions_data)
},
'modin': {
'classes': mpd.DataFrame(categories_data),
'areas': mpd.DataFrame(regions_data)
}
}
lookup_data = create_lookup_data()
We outline the create_lookup_data operate to generate two reference tables: one for product classes and one other for areas, every containing related metadata similar to fee charges, tax charges, and delivery prices. We put together these lookup tables in each pandas and Modin codecs so we will later use them in be part of operations and benchmark their efficiency throughout each libraries.
def advanced_joins(df, lookup):
end result = df.merge(lookup['categories'], on='class', how='left')
end result = end result.merge(lookup['regions'], on='area', how='left')
end result['commission_amount'] = end result['transaction_amount'] * end result['commission_rate']
end result['tax_amount'] = end result['transaction_amount'] * end result['tax_rate']
end result['total_cost'] = end result['transaction_amount'] + end result['tax_amount'] + end result['shipping_cost']
return end result
join_results = benchmark_operation(
lambda df: advanced_joins(df, lookup_data['pandas']),
lambda df: advanced_joins(df, lookup_data['modin']),
dataset,
"Superior Joins & Calculations"
)
We outline the advanced_joins operate to counterpoint our principal dataset by merging it with class and area lookup tables. After performing the joins, we calculate further fields, similar to commission_amount, tax_amount, and total_cost, to simulate real-world monetary calculations. Lastly, we benchmark this complete be part of and computation pipeline utilizing each pandas and Modin to judge how properly Modin handles complicated multi-step operations.
print("n" + "="*60)
print("MEMORY EFFICIENCY COMPARISON")
print("="*60)
def get_memory_usage(df, title):
"""Get reminiscence utilization of dataframe"""
if hasattr(df, '_to_pandas'):
memory_mb = df.memory_usage(deep=True).sum() / 1024**2
else:
memory_mb = df.memory_usage(deep=True).sum() / 1024**2
print(f"{title} reminiscence utilization: {memory_mb:.1f} MB")
return memory_mb
pandas_memory = get_memory_usage(dataset['pandas'], "Pandas")
modin_memory = get_memory_usage(dataset['modin'], "Modin")
We now shift focus to reminiscence utilization and print a bit header to spotlight this comparability. Within the get_memory_usage operate, we calculate the reminiscence footprint of each Pandas and Modin DataFrames utilizing their inside memory_usage strategies. We guarantee compatibility with Modin by checking for the _to_pandas attribute. This helps us assess how effectively Modin handles reminiscence in comparison with pandas, particularly with giant datasets.
print("n" + "="*60)
print("PERFORMANCE SUMMARY")
print("="*60)
outcomes = [groupby_results, cleaning_results, ts_results, join_results]
avg_speedup = sum(r['speedup'] for r in outcomes) / len(outcomes)
print(f"nAverage Speedup: {avg_speedup:.2f}x")
print(f"Finest Operation: {max(outcomes, key=lambda x: x['speedup'])['operation']} "
f"({max(outcomes, key=lambda x: x['speedup'])['speedup']:.2f}x)")
print("nDetailed Outcomes:")
for lead to outcomes:
print(f" {end result['operation']}: {end result['speedup']:.2f}x speedup")
print("n" + "="*60)
print("MODIN BEST PRACTICES")
print("="*60)
best_practices = [
"1. Use 'import modin.pandas as pd' to replace pandas completely",
"2. Modin works best with operations on large datasets (>100MB)",
"3. Ray backend is most stable; Dask for distributed clusters",
"4. Some pandas functions may fall back to pandas automatically",
"5. Use .to_pandas() to convert Modin DataFrame to pandas when needed",
"6. Profile your specific workload - speedup varies by operation type",
"7. Modin excels at: groupby, join, apply, and large data I/O operations"
]
for tip in best_practices:
print(tip)
ray.shutdown()
print("n✅ Tutorial accomplished efficiently!")
print("🚀 Modin is now able to scale your pandas workflows!")
We conclude our tutorial by summarizing the efficiency benchmarks throughout all examined operations, calculating the common speedup that Modin achieved over pandas. We additionally spotlight the best-performing operation, offering a transparent view of the place Modin excels most. Then, we share a set of greatest practices for utilizing Modin successfully, together with tips about compatibility, efficiency profiling, and conversion between pandas and Modin. Lastly, we shut down Ray.
In conclusion, we’ve seen firsthand how Modin can supercharge our pandas workflows with minimal modifications to our code. Whether or not it’s complicated aggregations, time collection evaluation, or memory-intensive joins, Modin delivers scalable efficiency for on a regular basis duties, notably on platforms like Google Colab. With the facility of Ray underneath the hood and near-complete pandas API compatibility, Modin makes it easy to work with bigger datasets.
Try the Codes. All credit score for this analysis goes to the researchers of this challenge. Additionally, be at liberty to observe us on Twitter, and Youtube and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication.
Nikhil is an intern marketing consultant at Marktechpost. He’s pursuing an built-in twin diploma in Supplies on the Indian Institute of Know-how, Kharagpur. Nikhil is an AI/ML fanatic who’s all the time researching purposes in fields like biomaterials and biomedical science. With a robust background in Materials Science, he’s exploring new developments and creating alternatives to contribute.