Python Performance Optimization Case Study | 100× Faster Data Processing
이 글의 핵심
Python data processing speedup: profiling, NumPy, Cython, and multiprocessing in practice.
Introduction
“Python is too slow for production” is a common myth. With the right techniques, it can be fast enough. Here we cut a batch job from about 10 hours to minutes—roughly 100× in wall time for our workload.
What you will learn
- Using cProfile in anger
- Replacing hot loops with NumPy
- When Cython helps
- Multiprocessing caveats
Table of contents
- Problem: painfully slow processing
- Baseline
- Profiling with cProfile
- Bottleneck 1: nested loops
- Optimization 1: NumPy
- Bottleneck 2: string building
- Optimization 2: Cython (optional)
- Optimization 3: multiprocessing
- Final results
- Closing thoughts
1. Problem
Sketch
import csv
def process_file(filename):
with open(filename) as f:
reader = csv.DictReader(f)
results = []
for row in reader:
result = calculate(row)
results.append(result)
return results
def calculate(row):
total = 0
for i in range(1000):
for j in range(100):
total += float(row['value']) * i * j
return total
$ time python process_data.py input.csv
real 10h 23m 45s
2. Baseline
start = time.time()
results = process_file('sample_1000.csv')
elapsed = time.time() - start
print(f"1000 rows: {elapsed:.2f}s")
estimated_hours = (elapsed * 1000000 / 1000) / 3600
print(f"Estimated for 1M rows: {estimated_hours:.1f} hours")
3. Profiling
$ python -m cProfile -o profile.stats process_data.py sample_1000.csv
import pstats
stats = pstats.Stats('profile.stats')
stats.sort_stats('cumulative')
stats.print_stats(10)
Finding: calculate dominates (~94% in the example profile).
4. Bottleneck 1
Nested loops × every row → huge operation counts.
5. Optimization 1: NumPy
import numpy as np
import pandas as pd
def process_file_numpy(filename):
df = pd.read_csv(filename)
values = df['value'].values
i_range = np.arange(1000)
j_range = np.arange(100)
ij_product = np.outer(i_range, j_range).sum()
results = values * ij_product
return results
Why faster: vectorized C loops, contiguous memory, less Python overhead.
6. Bottleneck 2: string concat
Avoid:
output = ""
for r in results:
output += f"{r}\n"
Prefer:
return "\n".join(str(r) for r in results)
7. Optimization 2: Cython
# calculate.pyx
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def calculate_cython(double value):
cdef long i, j
cdef double total = 0.0
for i in range(1000):
for j in range(100):
total += value * i * j
return total
Build with cythonize, import from Python—large speedup on pure numeric hot loops.
8. Optimization 3: multiprocessing
from multiprocessing import Pool
import numpy as np
def process_file_parallel(filename, num_workers=4):
df = pd.read_csv(filename)
values = df['value'].values
chunks = np.array_split(values, num_workers)
with Pool(num_workers) as pool:
results = pool.map(process_chunk, chunks)
return np.concatenate(results)
Note: overhead can hurt small inputs; measure.
9. Results
| Stage | Approach | Time (illustrative) |
|---|---|---|
| 0 | Pure Python | ~10h for 1M rows |
| 1 | NumPy vectorized | minutes |
| 2 | String join fixes | small extra win |
| 3 | Multiprocessing | may help or hurt |
Your exact factors depend on data size, cores, and I/O.
10. Extras
- PyPy for pure Python (not always faster with NumPy)
- Numba
@jitfor numeric kernels mmapfor huge files
Closing thoughts
- Measure with a profiler
- Reduce complexity before micro-optimizing
- NumPy for numeric hot loops
- Parallelize only when profiling says CPU-bound and worth the overhead
- Cython/Numba for remaining hotspots
Python is not inherently slow—accidentally quadratic Python is.
FAQ
Q1. pandas vs NumPy?
pandas for tables; NumPy for raw arrays—often used together.
Q2. threads vs processes?
CPU-bound: multiprocessing. I/O-bound: threads/async often win.
Q3. Cython vs Numba?
Cython: flexible, build step. Numba: great for NumPy-like numeric code.
Related posts
- Python performance
- NumPy vectorization
- Python multiprocessing
Keywords
Python, performance, profiling, cProfile, NumPy, vectorization, Cython, Numba, multiprocessing, data processing, case study