Python Performance Optimization Case Study | 100× Faster Data Processing

Python Performance Optimization Case Study | 100× Faster Data Processing

이 글의 핵심

Python data processing speedup: profiling, NumPy, Cython, and multiprocessing in practice.

Introduction

“Python is too slow for production” is a common myth. With the right techniques, it can be fast enough. Here we cut a batch job from about 10 hours to minutes—roughly 100× in wall time for our workload.

What you will learn

  • Using cProfile in anger
  • Replacing hot loops with NumPy
  • When Cython helps
  • Multiprocessing caveats

Table of contents

  1. Problem: painfully slow processing
  2. Baseline
  3. Profiling with cProfile
  4. Bottleneck 1: nested loops
  5. Optimization 1: NumPy
  6. Bottleneck 2: string building
  7. Optimization 2: Cython (optional)
  8. Optimization 3: multiprocessing
  9. Final results
  10. Closing thoughts

1. Problem

Sketch

import csv

def process_file(filename):
    with open(filename) as f:
        reader = csv.DictReader(f)
        results = []
        
        for row in reader:
            result = calculate(row)
            results.append(result)
        
        return results

def calculate(row):
    total = 0
    for i in range(1000):
        for j in range(100):
            total += float(row['value']) * i * j
    return total
$ time python process_data.py input.csv

real    10h 23m 45s

2. Baseline

start = time.time()
results = process_file('sample_1000.csv')
elapsed = time.time() - start
print(f"1000 rows: {elapsed:.2f}s")

estimated_hours = (elapsed * 1000000 / 1000) / 3600
print(f"Estimated for 1M rows: {estimated_hours:.1f} hours")

3. Profiling

$ python -m cProfile -o profile.stats process_data.py sample_1000.csv
import pstats
stats = pstats.Stats('profile.stats')
stats.sort_stats('cumulative')
stats.print_stats(10)

Finding: calculate dominates (~94% in the example profile).


4. Bottleneck 1

Nested loops × every row → huge operation counts.


5. Optimization 1: NumPy

import numpy as np
import pandas as pd

def process_file_numpy(filename):
    df = pd.read_csv(filename)
    values = df['value'].values
    
    i_range = np.arange(1000)
    j_range = np.arange(100)
    ij_product = np.outer(i_range, j_range).sum()
    
    results = values * ij_product
    
    return results

Why faster: vectorized C loops, contiguous memory, less Python overhead.


6. Bottleneck 2: string concat

Avoid:

output = ""
for r in results:
    output += f"{r}\n"

Prefer:

return "\n".join(str(r) for r in results)

7. Optimization 2: Cython

# calculate.pyx
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def calculate_cython(double value):
    cdef long i, j
    cdef double total = 0.0
    
    for i in range(1000):
        for j in range(100):
            total += value * i * j
    
    return total

Build with cythonize, import from Python—large speedup on pure numeric hot loops.


8. Optimization 3: multiprocessing

from multiprocessing import Pool
import numpy as np

def process_file_parallel(filename, num_workers=4):
    df = pd.read_csv(filename)
    values = df['value'].values
    
    chunks = np.array_split(values, num_workers)
    
    with Pool(num_workers) as pool:
        results = pool.map(process_chunk, chunks)
    
    return np.concatenate(results)

Note: overhead can hurt small inputs; measure.


9. Results

StageApproachTime (illustrative)
0Pure Python~10h for 1M rows
1NumPy vectorizedminutes
2String join fixessmall extra win
3Multiprocessingmay help or hurt

Your exact factors depend on data size, cores, and I/O.


10. Extras

  • PyPy for pure Python (not always faster with NumPy)
  • Numba @jit for numeric kernels
  • mmap for huge files

Closing thoughts

  1. Measure with a profiler
  2. Reduce complexity before micro-optimizing
  3. NumPy for numeric hot loops
  4. Parallelize only when profiling says CPU-bound and worth the overhead
  5. Cython/Numba for remaining hotspots

Python is not inherently slow—accidentally quadratic Python is.


FAQ

Q1. pandas vs NumPy?

pandas for tables; NumPy for raw arrays—often used together.

Q2. threads vs processes?

CPU-bound: multiprocessing. I/O-bound: threads/async often win.

Q3. Cython vs Numba?

Cython: flexible, build step. Numba: great for NumPy-like numeric code.


  • Python performance
  • NumPy vectorization
  • Python multiprocessing

Keywords

Python, performance, profiling, cProfile, NumPy, vectorization, Cython, Numba, multiprocessing, data processing, case study