Skip to content

Concurrent and Parallel Programming in Python with asyncio, threading, and multiprocessing

This page is mostly based on a short presentation with my friend Caspar. You can get the slides here.

Basics: Concurrency vs. Parallelism

When you read about concurrency, multithreading, or parallel programming, there will be two terms that you have to understand: Concurrency and Parallelism.


A concurrent execution is not guaranteed to be parallel. Concurrency only means that you have two or more threads (i.e. a sequence of operations) where the operations of the threads can be interleaved. Interleaved means that the two operation sequences are combined into a single sequence in any possible combination.

For example, you have thread A with operations [A1, A2, A3] and thread B with operations [B1, B2, B3]. In a concurrent execution with a single execution unit (i.e. CPU core), you might run A1 and A2, then B1, then A3 and finish with B2 and B3. However, A1 and B1 (and every other combination of operations) will never be executed at the same time, only one after another.

The main benefit of concurrency is that multiple threads can be executed by a bit in a given time; you don't have to wait for a thread to finish in order to progress a second thread. For example, you can react to user input in the GUI while doing a lenghty background calculation. This means that your application can be more responsive to user inputs. However, you might not have a large performance increase for CPU-intensive tasks, because you still have a single executor (i.e. CPU core) for your threads. If you need more performance, you have to use parallelism.


When you have a parallel execution, then it will be concurrent as well. The main difference is that two or more operations can be executed at the exact same time, for example on two CPU cores. This has large benefits for the performance of your program: it finishes more quickly, or it has more throughput in the same time.

Examples of Concurrent and/or Parallel Software

  • When your browser runs JavaScript from a website, it's executed concurrently on a single executor (with an event loop). However, internal components of the JavaScript engine like IO are parallel. If you need to do more CPU processing in your own code, you can use WebWorkers that don't block the main event loop.
  • The scientific computing package Numpy for Python can do calculations in parallel.

Basics: Python Implementations

Python is an interpreted language. There are multiple interpreters available for Python, like:

  • CPython ("official" reference implementation)
  • MicroPython (for microcontrollers)
  • Jython (Java implementation on JVM)
  • Stackless Python

If you have a standard Python installation, you most likely have CPython installed. It's important to know that CPython compiles user scripts to bytecode before executing it.

A major difference between implementations is whether they contain the Global Interpreter Lock (GIL) or not.

The Global Interpreter Lock (GIL)

In CPython, the GIL ensures that only one Python thread can run bytecode at the same time. This ensures exclusive access to interpreter internals for the current thread, because accessing the internal data structures is not thread safe.

In the following diagram, you can see two Python threads. Thread 1 takes the GIL first and blocks Thread 2 until:

  • A timeout of 5ms is reached
  • Thread 1 does a syscall (like blocking IO/Network operations or calling time.sleep())
  • Thread 1 calls a special library function from NumPy, SciPy, zlib, ...
    • Note: Some functions of these libraries are implemented in C in such a way that they don't require the GIL while doing CPU-intensive work.
flowchart LR
    classDef PythonInterpreter stroke:green,stroke-width:2px,stroke-dasharray: 5 5
    subgraph Int1[Python Interpreter]
        direction BT
        GIL("🔒 GIL")
        T1(Thread 1) -- 1. --> GIL
        T2(Thread 2) -- 2. ---x GIL
        linkStyle 1 stroke:red,stroke-dasharray: 3 3;

     class Int1 PythonInterpreter

In most cases, this is not a problem for performance, as blocking waits for IO completion have more impact. However, if you implement a CPU-intensive task purely in Python (e.g. image processing or calculations without external libraries), you might run into a bottleneck.

Showcase of Libraries

Threading (thread based)

The threading library is a simple library to create threads that run concurrently. These threads are kernel level threads, not user level threads. As explained above, you have limited parallelism due to the GIL.

flowchart LR
    classDef PythonInterpreter stroke:green,stroke-width:2px,stroke-dasharray: 5 5
    subgraph Int1[Python Interpreter]
        direction BT
        GIL("🔒 GIL")
        T1(Thread 1) -- 1. --> GIL
        T2(Thread 2) -- 2. --> GIL

     class Int1 PythonInterpreter

Using the threading library is straightforward, as you can see in the following example:

from threading import Thread
import time

def my_func(line: str):
    print(f"Output: {line}")

t1 = Thread(target=my_func, args=("test",))
t2 = Thread(target=my_func, args=("test2",))

# ... do something else

# Wait until Thread 1 and 2 are finished

You can inherit from Thread as well if you want to do it object oriented:

# Alternative: Create Subclass of Thread
class MyThread(thread):
    def run():

    # ...

asyncio (coroutine based)

The asyncio library has a different paradigm by using Coroutines and an Event Loop. Additionally, it uses the new syntax keywords def async and await. Use cases for this are lightweight IO tasks, like handling HTTP requests in a web server.

flowchart LR
    classDef PythonInterpreter stroke:green,stroke-width:2px,stroke-dasharray: 5 5
    subgraph Int1[Python Interpreter]
        direction BT
        subgraph T[Thread 1]
            direction LR
            EL("🔁<br/>Event Loop") -- get new task --> TQ("🗄️ <br/>Task Queue")
            EL -- run task asynchronously --> EL
        T --> GIL("🔒GIL")
     class Int1 PythonInterpreter

When you use asyncio, you have to consider the two following "contexts":

  1. The normal context: This context is what you're used to while programming Python. You can call normal functions declared with def, where the function call blocks until the function returns. However, you can't call async def functions directly.
  2. The async context: This is your context inside of an async def function. You can call normal def functions as usual, but now you can call other async def functions as well. These async def functions are then executed asynchronously, and when you need their result, you can wait for them with await.

When you declare a function with async def, it is considered a native coroutine function. When you call this function, it returns a coroutine object. However, the coroutine won't run automatically. There are three ways to run a coroutine object:

  1. Call from a normal context
  2. Use await awaitable_object from an async context (an awaitable object can be a coroutine object or a task)
  3. Create a task with asyncio.create_task(coroutine_object) from an async context

The following example shows how you can run coroutine objects by creating a task. By creating a task, you can guarantee that your coroutine will run sometime during the lifetime of your program. However, you have to store a reference to your task somewhere to prevent the garbage collector from freeing the task before it can be executed (see also The Heisenbug lurking in your async code).

import asyncio


async def calc_coro():
    await asyncio.sleep(2) # some asynchronous operation
    print("calc done")
    return "foo"

async def main():
    coroutine_object = calc_coro() # 2. call coroutine function to get coroutine object
    task = asyncio.create_task(calc_coro()) # 3. run obtained coroutine object with task
    print("do other stuff")

    await task # can be skipped if completion / result of task is not important
    # If called without await: throws InvalidStateError result is not set
    calcresult = task.result()
    print(f"Result of calculation: {calcresult}")

if __name__ == "__main__": # 1. run coroutine from normal context

A more modern and simple alternative to storing Task references in a global list is using a TaskGroup. This group blocks until all Tasks that were created are finished.

import asyncio

async def mylog(line: str):
    await asyncio.sleep(1)
    print("Output: " + line)

async def main():
    # New in Python 3.11
    async with asyncio.TaskGroup() as tg:
        task1 = tg.create_task(mylog("coro1"))
        task2 = tg.create_task(mylog("coro2"))
    print("all tasks completed")

if __name__ == "__main__":

Multiprocessing (process based)

The previous two libraries aren't suited for CPU-intensive tasks, as they are limited by the GIL. However, what options do you have if you simply need more performance for your Python program? For this, you can use the multithreading library. Instead of creating new threads, this library creates new processes running their own Python interpreter, thus bypassing the GIL by using one for every process. This means that your code can actually run in parallel, instead of only concurrent.

flowchart LR
    classDef PythonInterpreter stroke:green,stroke-width:2px,stroke-dasharray: 5 5
    subgraph MP[Multiprocessing]
        subgraph Int1[Python Interpreter]
            direction BT
            T1(Thread 1) --> GIL1("🔒GIL")

        subgraph Int2[Python Interpreter]
            direction BT
            T2(Thread 2) --> GIL2("🔒GIL")
        class Int1,Int2 PythonInterpreter

Similar to the threading library, you can pass a target function that the new process should run:

from multiprocessing import Process, Queue

def my_process():
    print("this is a second python interpreter")

if __name__ == "__main__":
    p = Process(target=my_process)
    print("this is the first python interpreter")

However, as they are now two separate processes, you can't access the same memory (i.e. variables) anymore. If the processes have to communicate with each other, you can use a Queue:

from multiprocessing import Process, Queue

def my_process(q):
    # sends data through the queue
    q.put(["python", "is", "cool"])

if __name__ == "__main__":
    q = Queue()
    # create a new process -> separate Python interpreter
    p = Process(target=my_process, args=(q,))

Alternatively, you can use a Pipe. The main difference between a Queue and a Pipe is that1:

  • A Pipe can only have two endpoints (and thus has better performance).
  • A Queue can have multiple producers and consumers.
from multiprocessing import Process, Pipe

def my_process(pipe):
    pipe.send(["python", "is", "cool"])

if __name__ == "__main__":
  parent_pipe, child_pipe = Pipe()
  p = Process(target=)