Python prioritizes developer productivity over raw speed. When performance matters, measure first, then optimize the right bottlenecks.

Rule #1: Profile Before Optimizing

Never guess where your code is slow. Use profilers:

cProfile — Function-Level Profiling

  import cProfile
import pstats

def slow_function():
    total = 0
    for i in range(1_000_000):
        total += i ** 2
    return total

cProfile.run('slow_function()', 'profile.stats')

stats = pstats.Stats('profile.stats')
stats.sort_stats('cumulative').print_stats(10)
  

Run from CLI:

  python -m cProfile -s cumulative your_script.py
  

timeit — Micro-Benchmarks

  import timeit

time_list = timeit.timeit(
    "[x**2 for x in range(1000)]",
    number=10000
)
time_gen = timeit.timeit(
    "(x**2 for x in range(1000))",
    number=10000
)
print(f"List comp: {time_list:.4f}s, Generator: {time_gen:.4f}s")
  

Algorithmic Optimization

The biggest wins come from better algorithms, not faster loops:

  # O(n²) — slow for large inputs
def has_duplicate_slow(items):
    for i, a in enumerate(items):
        for b in items[i+1:]:
            if a == b:
                return True
    return False

# O(n) — use a set
def has_duplicate_fast(items):
    seen = set()
    for item in items:
        if item in seen:
            return True
        seen.add(item)
    return False
  

Built-in Optimizations

Use Built-in Functions and Libraries

Built-ins are implemented in C and are much faster:

  # Slow
total = 0
for x in data:
    total += x

# Fast
total = sum(data)
  

NumPy, Pandas, and itertools are optimized C implementations — use them for numerical and iteration-heavy work.

List Comprehensions vs Loops

List comprehensions are generally faster than equivalent for loops:

  # Prefer
squares = [x**2 for x in range(1000)]

# Over
squares = []
for x in range(1000):
    squares.append(x**2)
  

Generators for Large Data

Generators use constant memory instead of building entire lists:

  def read_large_file(path):
    with open(path) as f:
        for line in f:
            yield line.strip()
  

Caching with functools.lru_cache

Memoize expensive pure functions:

  from functools import lru_cache

@lru_cache(maxsize=None)
def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

print(fibonacci(100))  # instant
  

slots for Memory

Reduce memory per instance when creating millions of objects:

  class Point:
    __slots__ = ('x', 'y')

    def __init__(self, x, y):
        self.x = x
        self.y = y
  

When to Reach for C/Rust Extensions

If profiling shows a specific hot loop that can’t be vectorized:

  • Cython — compile Python-like code to C
  • PyO3 / maturin — write Rust extensions
  • Numba — JIT compile numerical functions
  from numba import jit

@jit(nopython=True)
def fast_sum(arr):
    total = 0.0
    for x in arr:
        total += x
    return total
  

Optimization Checklist

  1. Measure with cProfile or py-spy
  2. Fix algorithms — O(n²) → O(n log n) beats micro-optimizations
  3. Use the right data structure — set/dict for lookups, deque for queues
  4. Leverage libraries — NumPy, pandas, orjson
  5. Cache repeated pure computations
  6. Parallelize CPU work with multiprocessing
  7. Only then consider C extensions

Premature optimization wastes time. Profile-driven optimization delivers real results.