A few years ago I used weave for improving some parts of my code. Back then I was annoyed by the rather sparse documentation. It seems some things changed now, and Cython (wiki) seems like a good candidate.

Cython is a method of including C-like code directly into Python scripts. In some cases, more fine-grained control over how the program is executed can give you a significant speed increase. Typically, Cython can give a 100x–1000x speed increase. See for example this (dated) overview. With Cython you can get near-C speeds while retaining the flexibility of Python code, which you can see below.

I stumbled across a nice read on Cython about speed optimisation with Cython. I encourage you to read the post yourself, but here are the impressive results of Python, NumPy and Cython:

Pure Python

from numpy import zeros
from scipy import weave

dx = 0.1
dy = 0.1
dx2 = dx*dx
dy2 = dy*dy

def py_update(u):
    nx, ny = u.shape
    for i in xrange(1,nx-1):
        for j in xrange(1, ny-1):
            u[i,j] = ((u[i+1, j] + u[i-1, j]) * dy2 +
                      (u[i, j+1] + u[i, j-1]) * dx2) / (2*(dx2+dy2))

def calc(N, Niter=100, func=py_update, args=()):
    u = zeros([N, N])
    u[0] = 1
    for i in range(Niter):
        func(u,*args)
    return u

NumPy implementation

def num_update(u):
    u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + 
                    (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))

Cython code

This one is called 'Faster Cython' on the linked blogpost

#cython: boundscheck=False
#cython: wraparound=False
cimport numpy as np

def cy_update(np.ndarray[double, ndim=2] u, double dx2, double dy2):
    cdef unsigned int i, j
    for i in xrange(1,u.shape[0]-1):
        for j in xrange(1, u.shape[1]-1):
            u[i,j] = ((u[i+1, j] + u[i-1, j]) * dy2 +
                      (u[i, j+1] + u[i, j-1]) * dx2) / (2*(dx2+dy2))

which is imported into the Python program with

import pyximport
import numpy as np
pyximport.install(setup_args={'include_dirs':[np.get_include()]})
from _laplace import cy_update as cy_update2

Performance results

Method         Time (sec)  Relative Speed
Pure Python    560         250
NumPy          2.24        1
Cython         1.28        0.57
Weave          1.02        0.45
Faster Cython  0.94        0.42

(sorry for the crappy 'table' but it's impossible to search anything on dotclear formatting because all the documentation is in French. Which of course is the best language in the world, times a thousand. In fact, Chuck Norris probably spoke French...</rant>)

So NumPy is already quite fast, but you can squeeze some extra performance out of your CPU when you use Cython, while you can still use Python-like code and don't have to care about the memory management of NumPy arrays etc.

References