Python 3.12 introduced a new API for “sub interpreters”, which are a different parallel execution model for Python that provide a nice compromise between the true parallelism of multiprocessing, but with a much faster startup time. In this post, I’ll explain what a sub interpreter is, why it’s important for parallel code execution in Python and how it compares with other approaches.
Python’s system architecture is roughly made up of three parts:
To learn more about this, you should read the “Parallelism and Concurrency” chapter of my book CPython Internals.
Since Python 1.5, there has been a C-API to have multiple interpreters, but this functionality was severely limited by the GIL and didn’t really enable true parallelism. As a consequence, the most commonly used technique for running code in parallel (without third party libraries) is to use the multiprocessing
module.
In 2017, CPython core developers proposed to change the structure of interpreters so that the they were better isolated from the owning Python process and could operate in parallel. The actual work to achieve this was pretty huge (it isn’t finished 6 years later) and is split into two PEPs. PEP684 changes the GIL to be per-interpreter and PEP554 which provides an API to create interpreters and share data between them.
The GIL is the “Global Interpreter Lock”, a lock in a Python process that means that only 1 instruction can execute at any time in a Python process, even if it has multiple threads. This effectively means that even if you start 4 Python threads and run them concurrently on your nice 4-core CPU, only 1 thread will be running at any one time.
You can see a simple test by creating a numpy array or integers and crudely calculating the distance of each value from 50:
import numpy
# Create a random array of 100,000 integers between 0 and 100
a = numpy.random.randint(0, 100, 100_000)
for x in a:
abs(x - 50)
In theory, you would expect (at least with a language like C) that by splitting the work into chunks and distributing the work to threads the execution time would be faster:
import numpy, threading
a = numpy.random.randint(0, 100, 100_000)
threads = []
# Split array into blocks of 100 and start a thread for each
for ar in numpy.split(a, 100):
t = threading.Thread(target=simple_abs_range, args=(ar,))
t.start()
threads.append(t)
for t in threads:
t.join()
In practice the second example is twice as slow. That’s because all those threads are bound to the same GIL and only 1 of them will execute at any time. The extra time is all the song-and-dance to spawn a thread to accomplish very little.
Despite it’s name, the GIL was never really a lock in the interpreter state. PEP684 changed that by stopping the sharing of the GIL between interpreters so that each interpreter in a Python process had it’s own GIL and could therefore run in parallel. A major reason for the per-interpreter GIL working taking many years is that CPython has internally relied upon the GIL as a source of thread-safety. This took the form of many C-Extensions having globally shared state. If you were to introduce a parallelism within the same Python process and two interpreters tried to write to the same memory space, bad things would happen.
In Python 3.12 and ongoing in Python 3.13, the Python standard library extensions written in C are being tested and any global shared state is being moved to a new API that puts that state within either the module, or in the interpreter state. Even when this work is completed, third party C extensions will likely have to test in sub interpreters (I maintain 1 library written in C++ and it required changes).
The very-recently approved proposal, PEP703 to make the GIL optional in CPython works collaboratively with the per-interpreter GIL. Both proposals have the prerequisites:
Another important point is that whilst PEP703 has been accepted, it was done so with the clause that it is an optional flag, and if it proved too problematic or complex then the changes would be reverted in future. Per-interpreter GIL on the other hand is largely complete and doesn’t require an alternative compile-time flag in CPython.
I will write a lot more about PEP703 in the future. Later in this post I’ll share some code where I’m going to combine the two approaches together.
The Python standard library has a few options for concurrent programming, depending on some factors:
Here are the models:
I gave a talk on this at PyCon APAC 2023 so check that out for a verbal, detailed explanation.
Or, in a table:
Model | Execution | Start-up time | Data Exchange |
---|---|---|---|
threads | Parallel * | small | Any |
coroutines | Concurrent | smallest | Any |
Async functions | Concurrent | smallest | Any |
Greenlets | Concurrent | smallest | Any |
multiprocessing | Parallel | large | Serialization |
Sub Interpreters | Parallel | medium | Serialization or Shared Memory |
* As we explored, Threads are only parallel with IO-bound tasks
In a simple benchmark, I measured the time to create:
Here are the results:
This benchmark showed that threading is about 100 times faster to start than sub interpreters, which are about 10 times faster than multiprocessing. What I’m hoping isn’t lost in this chart is how much closer to threads interpreters are than multiprocessing.
This benchmark also doesn’t measure the performance of data-sharing (which is must faster in sub interpreters than multiprocessing) and the memory overhead (which again is significantly less).
Running nothing in parallel isn’t a very useful benchmark for purposes other than measuring the minimum time to start. Next, I benchmarked a CPU-intensive workload to calculate Pi to 2000 decimal places.
Alas no. Going back to the first benchmark, sub interpreters are still 100 times slower to start than threads. So if the task is really small, for example calculating Pi to 200 digits then the benefits of parallelism outweigh the startup overhead and threading is still faster:
To visually explain the idea of there being a “cut-off” point when parallelism is faster, this graph shows the workload size and the rate of growth for the execution time. The cut-off time isn’t a fixed value because it depends on the CPU, the background tasks and lots of other variances.
Another important point is that multiprocessing
is often used in a model where the processes are long-running and handed lots of tasks instead of being spawned and destroyed for a single workload. One great example is Gunicorn, the popular Python web server. Gunicorn will spawn “workers” using multiprocessing
and those workers will live for the lifetime of the main process. The time to start a process or a sub interpreter then becomes irrelevant (at 89 ms or 1 second) when the web worker can be running for weeks, months or years.
The ideal way to use these parallel workers for small tasks (like handle a single web request) is to keep them running and use a main process to coordinate and distribute the workload:
Both multiprocessing processes and interpreters have their own import state. This is drastically different to threads and coroutines. When you await
an async function, you don’t need to worry about whether that coroutine has imported the required modules. The same applies for threads. For example, you can import something in your module and reference it from inside the thread function:
import threading
from super.duper.module import cool_function
def worker(info):
cool_function() # This already exists in the interpreter state
info = {'a': 1}
thread = Thread(target=worker, args=(info, ))
Half of the time taken to start an interpreter is taken up running “site import”. This is a special module called site.py
that lives within the Python installation. Interpreters have their own caches, their own builtins, they are effectively mini-Python processes. Starting a thread or a coroutine is so fast because it doesn’t have to do any of that work (it shares that state with the owning interpreter), but it’s bound by the lock and isn’t parallel.
The next point when using a parallel execution model like multiprocessing or sub interpreters is how you share data. Once you get over the hurdle of starting one, this quickly becomes the most important point. You have two questions to answer:
Let’s address those individually.
Whether using sub interpreters or multiprocessing you cannot simply send existing Python objects to worker processes.
Multiprocessing uses pickle
by default. When you start a process or use a process pool, you can use pipes, queues and shared memory as mechanisms to sending data to/from the workers and the main process. These mechanisms revolve around pickling. Pickling is the builtin serialization library for Python that can convert most Python objects into a byte string and back into a Python object.
Pickle is very flexible. You can serialize a lot of different types of Python objects (but not all) and Python objects can even define a method for how they can be serialized. It also handles nested objects and properties. However, with that flexibility comes a performance hit. Pickle is slow. So if you have a worker model that relies upon continuous inter-worker communication of complex pickled data you’ll likely see a bottleneck.
Sub interpreters can accept pickled data. They also have a second mechanism called shared data. Shared data is a high-speed shared memory space that interpreters can write to and share data with other interpreters. It supports only immutable types, those are:
I implemented the tuple sharing mechanism last week so that we had some option of a sequence type.
To share data with an interpreter, you can either set it as initialization data or you can send it through a channel.
This is a work in progress for sub interpreters. If a sub interpreter crashes, it won’t kill the main interpreter. Exceptions can be raised up to the main interpreter and handled gracefully. The details of this are still being worked out.
In Python 3.12, the sub interpreter API is experimental and some of the things I’m going to mention were only implemented a week ago so haven’t been released yet so you’ll see them in future releases of 3.13. If you want to compile the main branch of CPython you can follow along at home.
The interpreters module proposed in PEP554 isn’t finished. A version of it is a secret, hidden module called _xxsubinterpreters
. In all my code I rename the import to interpreters
because that’s what it’ll be called in future.
You can create, run and stop a sub interpreter with the .run()
function which takes a string or a simple function
import _xxsubinterpreters as interpreters
interpreters.run('''
print("Hello World")
''')
Starting a sub interpreter is a blocking operation, so most of the time you want to start one inside a thread.
from threading import Thread
import _xxsubinterpreters as interpreters
t = Thread(target=interpreters.run, args=("print('hello world')",))
t.start()
To start an interpreter that sticks around, you can use interpreters.create()
which returns the interpreter ID. This ID can be used for subsequent .run_string
calls:
import _xxsubinterpreters as interpreters
interp_id = interpreters.create(site=site)
interpreters.run_string(interp_id, "print('hello world')")
interpreters.run_string(interp_id, "print('hello universe')")
interpreters.destroy(interp_id)
To share data, you can use the shared
argument and provide a dictionary with shareable (int, float, bool, bytes, str, None, tuple) values:
import _xxsubinterpreters as interpreters
interp_id = interpreters.create(site=site)
interpreters.run_string(
interp_id,
"print(message)",
shared={
"message": "hello world!"
}
)
interpreters.run_string(
interp_id,
"""
for message in messages:
print(message)
""",
shared={
"messages": ("hello world!", "this", "is", "me")
}
)
interpreters.destroy(interp_id)
Once an interpreter is running (remembering what I said that it is preferable to leave them running) you can share data using a channel. The channels module is also part of PEP554 and available using a secret-import:
import _xxsubinterpreters as interpreters
import _xxinterpchannels as channels
interp_id = interpreters.create(site=site)
channel_id = channels.create()
interpreters.run_string(
interp_id,
"""
import _xxinterpchannels as channels
channels.send('hello!')
""",
shared={
"channel_id": channel_id
}
)
print(channels.recv(channel_id))
Applying the multiprocessing
and threading
models to web applications. The Python web servers that sit in front of your framework like Django, Flask, FastAPI or Quart use an interface called WSGI for traditional web frameworks and ASGI for async ones.
The web servers listen on a HTTP port for requests, then divvy up the requests to a pool of workers. If you only had 1 worker, then when a user made a HTTP request, everyone else would have to sit and wait until it finished responding to the first request (this is also why you should never ship python manage.py runserver
as the web server because it only has 1 worker).
The recommended best practice for Gunicorn is to run multiple Python processes, coordinated by a main process and for each process to have a pool of threads:
This design (sometimes called multi-worker-multi-thread) means that using multiprocessing you have 1 GIL for each CPU Core and you have a pool of threads to handle incoming requests concurrently. Uvicorn, the async implementation builds on that by using coroutines to handle concurrency with frameworks that support async.
There are some downsides to this approach. As we explored earlier, threads are not parallel so if you had 2 threads inside a single worker being very busy, Python can’t “move” or schedule that task on a different CPU core.
So, my goal is to replace multiprocessing
as the mechanism for the workers with an interpreter. This would have the benefit of using the high-performance shared memory channels API for inter-worker communication and the workers would be lighter weight taking up less memory on the host (leaving more memory and resources to process requests).
The second, rather wild goal is to compile CPython 3.13 (main branch) with the PEP703 GIL-less thread implementation to see if we can run GIL-free threads inside this model. I also want to identify issues early and report them upstream (there were a few).
For this experiment, I tried to fork Gunicorn and replace multiprocessing with sub interpreters. What I found was that this would be a huge effort because Gunicorn does have a concept of “workers” and abstracts those in an interface called worker classes. However, it makes some assumptions about the worker class capabilities which sub interpreters don’t fulfill.
Someone on Mastodon suggested I check out Hypercorn, which turned out to be ideal for this test. Hypercorn has an async worker module with a callable that could be imported from inside the interpreter. All I need to work out is:
So, I roughly followed this design:
threading.Thread
class and implement a custom .stop()
method that sends the signal to the sub interpreterThe worker class looks like this:
class SubinterpreterWorker(threading.Thread):
def __init__(self, number: int, config: Config, sockets: Sockets):
self.worker_number = number
self.interp = interpreters.create()
self.channel = channels.create()
self.config = config # TODO copy other parameters from config
self.sockets = sockets
super().__init__(target=self.run, daemon=True)
def run(self):
# Convert insecure sockets to a tuple of tuples because the Sockets type cannot be shared
insecure_sockets = []
for s in self.sockets.insecure_sockets:
insecure_sockets.append((int(s.family), int(s.type), s.proto, s.fileno()))
interpreters.run_string(
self.interp,
interpreter_worker,
shared={
'worker_number': self.worker_number,
'insecure_sockets': tuple(insecure_sockets),
'application_path': self.config.application_path,
'workers': self.config.workers,
'channel_id': self.channel,
}
)
def stop(self):
print("Sending stop signal to worker {}".format(self.worker_number))
channels.send(self.channel, "stop")
The sub interpreter daemon code (interpreter_worker
) is:
import sys
sys.path.append('experiments')
from hypercorn.asyncio.run import asyncio_worker
from hypercorn.config import Config, Sockets
import asyncio
import threading
import _xxinterpchannels as channels
from socket import socket
import time
shutdown_event = asyncio.Event()
def wait_for_signal():
while True:
msg = channels.recv(channel_id, default=None)
if msg == "stop":
print("Received stop signal, shutting down {} ".format(worker_number))
shutdown_event.set()
else:
time.sleep(1)
print("Starting hypercorn worker in subinterpreter {} ".format({worker_number}))
_insecure_sockets = []
# Rehydrate the sockets list from the tuple
for s in insecure_sockets:
_insecure_sockets.append(socket(*s))
hypercorn_sockets = Sockets([], _insecure_sockets, [])
config = Config()
config.application_path = application_path
config.workers = workers
thread = threading.Thread(target=wait_for_signal)
thread.start()
asyncio_worker(config, hypercorn_sockets, shutdown_event=shutdown_event)
The complete code is available on GitHub.
My first cut of this approach still doesn’t have multiple threads inside the sub interpreter (other than the signal thread). I’m building and testing an unstable build of CPython in debug mode. It’s not ready for a comparison performance test - yet.
PEP703 isn’t finished yet. The 100+ todo-list issue in GitHub is about 50% complete. Only at the end of this list can the GIL be disabled.
I also discovered a few issues. Firstly, Django won’t run at all. Remember earlier in this post I mentioned that some Python C extensions use a global shared state? Well, datetime
is one of those modules. There is an issue to update it, but it hasn’t been merged yet. The consequence is that if you import zoneinfo
from a sub interpreter, it will fail. Django uses zoneinfo
, so it doesn’t even start.
I did have more luck with a very crude FastAPI and Flask application. I was able to launch a 2, 4 and 10 worker setup with those applications. I did run a few benchmarks on the FastAPI and Flask applications to see that it handled 10,000 requests with a concurrency of 20. Both performed admirably with all my CPU cores beavering away.
I was pleasantly suprised since I wasn’t expecting it to work at all because sub interpreters are so new and the Python ecosystem hasn’t been testing them. The next step is to test some more complex web applications, continue to report crashes and issues then get this web worker into a state where it’s stable enough to benchmark.
Then I might just submit a PR to Hypercorn for Python 3.13’s release next year.