Profiling and optimizing your Python code | Toucan Toco


Profiling and optimizing your Python code

This article assumes you are already familiar with Python.

As soon as your code stops being trivial and runs on real-world data volumes, you will run into performance issues. You’ve heard that Python is slow, so it must be the culprit!

Not so fast!

Sure, Python is not compiled to optimized native code (yet) and therefore won’t rival C, Fortran or assembly in tightly nested loops. But most Python performance issues can be alleviated by improving the algorithm or using the right tool for the job. Read on to learn how.


Finding out why your program is slow by looking at its source code is a waste of time. It’s extremely hard to guess where a complex program is spending its time by just looking at the code.

You are the profiler

Even trivial code like the example below might be a puzzle:

"""Sorting a large, randomly generated string and writing it to disk"""
import random

def write_sorted_letters(nb_letters=10**7):
    random_string = ''
    for i in range(nb_letters):
        random_string += random.choice('abcdefghijklmnopqrstuvwxyz')
    sorted_string = sorted(random_string)

    with open("sorted_text.txt", "w") as sorted_text:
        for character in sorted_string:


The obvious bottleneck has to be the disk access, right? Well…

Let’s not guess and instead use a program to know exactly what’s going on in your program at runtime. That kind of program is called a profiler, and it will save you time on trial and error optimization. A profiler is a handy tool that will automatically time and count each execution of each piece of your code. Your first reflex must always be to profile your code before trying to guess which part to optimize.

Whole program profiling

Any IDE worth its salt will have an integrated profiler ready to launch at the click of a button.

In the command line however, you must run your whole program inside the profiler, like this:

python -m cProfile -s tottime

This will run your program unchanged, and then display the profiling’s result as the table below:

         40000054 function calls in 11.362 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10000000    4.137    0.000    5.166    0.000
        1    3.442    3.442   11.337   11.337
        1    1.649    1.649    1.649    1.649 {sorted}
 10000000    0.960    0.000    0.960    0.000 {method 'write' of 'file' objects}
 10000000    0.547    0.000    0.547    0.000 {method 'random' of '_random.Random' objects}
 10000000    0.482    0.000    0.482    0.000 {len}
        1    0.121    0.121    0.121    0.121 {range}
        1    0.021    0.021   11.362   11.362<module>)

These results are sorted by total time usage for each element (That’s where -s tottime comes from).

Which means that you really want to start optimizing those big time wasters at the top first!

So here, looking at the tottime column, it looks like the choice function from random module takes almost a third of our total run time.

This is clearly where we must focus our optimization efforts.

But before we jump into optimizing, let’s dive a little bit deeper into profiling.

Targeted profiling

Note that the command above will profile your whole program. If you want to be more precise, you could surround the pieces of code you’d like to profile between

import cProfile
cp = cProfile.Profile()



The resulting table will look similar to the one above, but won’t contain unneeded profiling information from other part of your code.

Since it’s hard to tell where your program spends its time, you should start by profiling your program entirely before narrowing down the profiling to a single section.

More info on the cProfile and Profile modules here.

Line profiling

Sometimes the profiler may lump together methods calls from different parts of your code, or just won’t be precise enough.

In that case, you may want to profile your program line by line, so you’ll know precisely where to focus your optimization efforts.

To do that, first, install the line_profiler tool by running the following command:

pip install line_profiler

Then, modify your code and decorate each function that you want to profile with @profile. Let’s decorate our function so it will look like this:

def write_sorted_letters(nb_letters=10**7):

You can then run your program through the line profiler like this:

kernprof -l -v

The result should look like this:

Total time: 21.4412 s
File: ./
Function: write_sorted_letters at line 5

Line #      Hits         Time    Per Hit   % Time  Line Contents
     5                                             @profile
     6                                             def write_sorted_letters(nb_letters=10**7):
     7         1            1        1.0      0.0      random_string = ''
     8  10000001      3230206        0.3     15.1      for _ in range(nb_letters):
     9  10000000      9352815        0.9     43.6          random_string += random.choice('abcdefghijklmnopqrstuvwxyz')
    10         1      1647254  1647254.0      7.7      sorted_string = sorted(random_string)
    12         1         1334     1334.0      0.0      with open("sorted_text.txt", "w") as sorted_text:
    13  10000001      2899712        0.3     13.5          for character in sorted_string:
    14  10000000      4309926        0.4     20.1              sorted_text.write(character)

First, note that this profiling tool is slowing down our program almost twofold, from 11 seconds to 21.

But there is an upside: we have our program on the right hand side and can see which lines are impacting the performance of our application and how much so.

OK, grandpa profiling is cute, but how about realtime continuous web application profiling?

The integrated profiler is simple enough and can get you where you want in terms of performance for single-thread, local workloads. But a large, threaded, web application is a whole different beast.

Let’s have a look at the awesome Profiling module.

First, install it with sudo pip install profiling, then run your program through it: profiling Don’t forget to remove the @profile decorator that will only work with the line_profiler.

It gives us a detailed tree-like view of the profile at the end of the program’s run:


It’s interactive so you can navigate and fold/unfold each line by using the arrow keys.

There is also a live mode for long running processes such as web servers. You can invoke it like this: profiling live-profile You can interact with it as your program is running to explore your program’s performance profile as it runs.

Profiling resources


Are you losing your time in a loop?

Now that we know where our program spends our precious CPU cycles, we can optimize accordingly.

A note of warning

You should only ever optimize when and where necessary, since the resulting code will often be harder to read and maintain than “slow” code.

Optimizing is trading maintainability for performance.

Numpy to the rescue

So, it looks like this random.choice function is slowing us down…

Let’s replace it with a similar function from the famous numpy library. The arguments to this function are slightly different from random.choice, so the code now looks like this:

"""Sorting a large, randomly generated string and writing it to disk"""
from numpy import random

def write_sorted_letters(nb_letters=10**7):
    letters = tuple('abcdefghijklmnopqrstuvwxyz')

    random_letters = random.choice(letters, nb_letters)

    sorted_string = random_letters.tostring()

    with open("sorted_text.txt", "w") as sorted_text:
        for character in sorted_string:


Numpy is a set of powerful and fast numerical functions that are compiled to native code. It’s also making use of parallel processing when possible.

You can install it by a single command: pip install numpy

So let’s run our new program through profiling to see the improvement:

         10011861 function calls (10011740 primitive calls) in 3.357 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10000000    1.272    0.000    1.272    0.000 {method 'write' of 'file' objects}
        1    1.268    1.268    3.321    3.321
        1    0.657    0.657    0.657    0.657 {method 'sort' of 'numpy.ndarray' objects}
        1    0.120    0.120    0.120    0.120 {method 'choice' of 'mtrand.RandomState' objects}
        4    0.009    0.002    0.047    0.012<module>)
        1    0.003    0.003    0.003    0.003 {method 'tostring' of 'numpy.ndarray' objects}

Nice! Our new program is almost four times faster than the previous version (3.3 seconds vs 11.362 seconds).

If we look at the total time column again, we now see that the write operation is slowing us down the most.

Let’s fix that by replacing:

with open("sorted_text.txt", "w") as sorted_text:
    for character in sorted_string:


with open("sorted_text.txt", "w") as sorted_text:

This avoids writing to disk character by character, and instead write the whole string at once, making use of the disk cache and buffers to speed up file writing.

Let’s simply time our new program to see the end result:

time python

Which gives us:

real 0m0.874s
user 0m0.852s
sys  0m0.280s

Now our program that used to take more that 11 seconds is running in less than a second! Ain’t that wonderful?

Other performance tricks

Keep in mind these latency numbers in a computer:

 Latency Comparison Numbers
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

Taken from Latency Numbers Every Programmer Should Know

Optimizing resources