Should I jump straight to Cython?

Profile first. Many bottlenecks are pure-Python loops fixable with NumPy or better algorithms—cheaper than extensions.

Does the GIL make threads useless?

For CPU-bound work, threads rarely help; use multiprocessing, vectorized/native code, or I/O-bound threading appropriately.

NumPy is still slow—why?

Check accidental copies, Python scalar round-trips, dtype, layout, and broadcasting mistakes.

Results changed after optimization.

Floating-point reordering and rounding differ. Define tolerances in tests and compare before/after.

Python Performance Optimization Case Study | 100× Faster Data Processing

2026년 3월 30일 · 23분 읽기 · 수정 2026년 3월 30일 Intermediate

이 글의 핵심

Python data processing speedup: profiling, NumPy, Cython, and multiprocessing in practice.

Introduction

“Python is too slow for production” is a common myth. With the right techniques, it can be fast enough. Here we cut a batch job from about 10 hours to minutes—roughly 100× in wall time for our workload.

What you will learn

Using cProfile in anger
Replacing hot loops with NumPy
When Cython helps
Multiprocessing caveats

Problem: painfully slow processing
Baseline
Profiling with cProfile
Bottleneck 1: nested loops
Optimization 1: NumPy
Bottleneck 2: string building
Optimization 2: Cython (optional)
Optimization 3: multiprocessing
Final results
Closing thoughts

1. Problem

Sketch

import csv

def process_file(filename):
    with open(filename) as f:
        reader = csv.DictReader(f)
        results = []
        
        for row in reader:
            result = calculate(row)
            results.append(result)
        
        return results

def calculate(row):
    total = 0
    for i in range(1000):
        for j in range(100):
            total += float(row['value']) * i * j
    return total

$ time python process_data.py input.csv

real    10h 23m 45s

2. Baseline

start = time.time()
results = process_file('sample_1000.csv')
elapsed = time.time() - start
print(f"1000 rows: {elapsed:.2f}s")

estimated_hours = (elapsed * 1000000 / 1000) / 3600
print(f"Estimated for 1M rows: {estimated_hours:.1f} hours")

3. Profiling

$ python -m cProfile -o profile.stats process_data.py sample_1000.csv

import pstats
stats = pstats.Stats('profile.stats')
stats.sort_stats('cumulative')
stats.print_stats(10)

Finding: calculate dominates (~94% in the example profile).

4. Bottleneck 1

Nested loops × every row → huge operation counts.

5. Optimization 1: NumPy

import numpy as np
import pandas as pd

def process_file_numpy(filename):
    df = pd.read_csv(filename)
    values = df['value'].values
    
    i_range = np.arange(1000)
    j_range = np.arange(100)
    ij_product = np.outer(i_range, j_range).sum()
    
    results = values * ij_product
    
    return results

Why faster: vectorized C loops, contiguous memory, less Python overhead.

6. Bottleneck 2: string concat

Avoid:

output = ""
for r in results:
    output += f"{r}\n"

Prefer:

return "\n".join(str(r) for r in results)

7. Optimization 2: Cython

# calculate.pyx
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def calculate_cython(double value):
    cdef long i, j
    cdef double total = 0.0
    
    for i in range(1000):
        for j in range(100):
            total += value * i * j
    
    return total

Build with cythonize, import from Python—large speedup on pure numeric hot loops.

8. Optimization 3: multiprocessing

from multiprocessing import Pool
import numpy as np

def process_file_parallel(filename, num_workers=4):
    df = pd.read_csv(filename)
    values = df['value'].values
    
    chunks = np.array_split(values, num_workers)
    
    with Pool(num_workers) as pool:
        results = pool.map(process_chunk, chunks)
    
    return np.concatenate(results)

Note: overhead can hurt small inputs; measure.

9. Results

Stage	Approach	Time (illustrative)
0	Pure Python	~10h for 1M rows
1	NumPy vectorized	minutes
2	String join fixes	small extra win
3	Multiprocessing	may help or hurt

Your exact factors depend on data size, cores, and I/O.

10. Extras

PyPy for pure Python (not always faster with NumPy)
Numba @jit for numeric kernels
mmap for huge files

Closing thoughts

Measure with a profiler
Reduce complexity before micro-optimizing
NumPy for numeric hot loops
Parallelize only when profiling says CPU-bound and worth the overhead
Cython/Numba for remaining hotspots

Python is not inherently slow—accidentally quadratic Python is.

FAQ

Q1. pandas vs NumPy?

pandas for tables; NumPy for raw arrays—often used together.

Q2. threads vs processes?

CPU-bound: multiprocessing. I/O-bound: threads/async often win.

Q3. Cython vs Numba?

Cython: flexible, build step. Numba: great for NumPy-like numeric code.

Python performance
NumPy vectorization
Python multiprocessing

Keywords

Python, performance, profiling, cProfile, NumPy, vectorization, Cython, Numba, multiprocessing, data processing, case study