One of my biggest points of criticism of Python is its slow cold start time. I especially notice this when I use it as a scripting language for CLIs. The startup time of a simple .py script can easily be in the 100 to 300 ms range, whereas a C, Rust, or Go program with the same functionality can start in under 10 ms. This becomes even more frustrating when piping several scripts together, because the accumulated startup latency adds up quickly.
Yes, that is also my feeling. But comparing an interpreted language with a compiled one is not really fair.<p>Here is my quick benchmark. I refrain from using Python for most scripting/prototyping task but really like Janet [0] - here is a comparison for printing the current time in Unix epoch:<p><pre><code> $ hyperfine --shell=none --warmup 2 "python3 -c 'import time;print(time.time())'" "janet -e '(print (os/time))'"
Benchmark 1: python3 -c 'import time;print(time.time())'
Time (mean ± σ): 22.3 ms ± 0.9 ms [User: 12.1 ms, System: 4.2 ms]
Range (min … max): 20.8 ms … 25.6 ms 126 runs
Benchmark 2: janet -e '(print (os/time))'
Time (mean ± σ): 3.9 ms ± 0.2 ms [User: 1.2 ms, System: 0.5 ms]
Range (min … max): 3.6 ms … 5.1 ms 699 runs
Summary
'janet -e '(print (os/time))'' ran
5.75 ± 0.39 times faster than 'python3 -c 'import time;print(time.time())''
</code></pre>
[0]: <a href="https://janet-lang.org/" rel="nofollow">https://janet-lang.org/</a>
> The startup time of a simple .py script can easily be in the 100 to 300 ms range<p>I can't say I've ever experienced this. Are you sure it's not related to other things in the script?<p>I wrote a single file Python script, it's a few thousand lines long. It can process a 10,000 line CSV file and do a lot of calculations to the point where I wrote an entire CLI income / expense tracker with it[0].<p>The end to end time of the command takes 100ms to process those 10k lines, that's using `time` to measure it. That's on hardware from 2014 using Python 3.13 too. It takes ~550ms to fully process 100k lines as well. I spent zero time optimizing the script but did try to avoid common pitfalls (drastically nested loops, etc.).<p>[0]: <a href="https://github.com/nickjj/plutus" rel="nofollow">https://github.com/nickjj/plutus</a>
> I can't say I've ever experienced this. Are you sure it's not related to other things in the script? I wrote a single file Python script, it's a few thousand lines long.<p>It's because of module imports, primarily and generally. It's worse with many small files than a few large ones (Python 3 adds a little additional overhead because of needing extra system calls and complexity in the import process, to handle `__pycache__` folders. A great way to demonstrate it is to ask pip to do something trivial (like `pip --version`, or `pip install` with no packages specified), or compare the performance of pip installed in a venv to pip used cross-environment (with `--python`). Pip imports literally hundreds of modules at startup, and hundreds more the first time it hits the network.
Makes sense, most of my scripts are standalone zero dependency scripts that import a few things from the standard library.<p>`time pip3 --version` takes 230ms on my machine.
And it's worse if your python libraries might be on network storage - like in a user's homedir in a shared compute environment.
Here is a benchmark <a href="https://github.com/bdrung/startup-time" rel="nofollow">https://github.com/bdrung/startup-time</a><p>This benchmark is a little bit outdated but the problem remains the same.<p>Interpreter initialization: Python builds and initializes its entire virtual machine and built-in object structures at startup. Native programs already have their machine code ready and need very little runtime scaffolding.<p>Dynamic import system: Python’s module import machinery dynamically locates, loads, parses, compiles, and executes modules at runtime. A compiled binary has already linked its dependencies.<p>Heavy standard library usage: Many Python programs import large parts of the standard library or third-party packages at startup, each of which runs top-level initialization code.<p>This is especially noticeable if you do not run on an M1 Ultra, but on some slower hardware. From the results on Rasperberry PI 3:<p>C: 2.19 ms<p>Go: 4.10 ms<p>Python3: 197.79 ms<p>This is about 200ms startup latency for a print("Hello World!") in Python3.
Interesting. The tests use Python 3.6, which on my system replicates the huge difference shown in startup time using and not using `-S`. From 3.7 onwards, it makes a much smaller percentage change. There's also a noticeable difference the first time; I guess because of Linux caching various things. (That effect is much bigger with Rust executables, such as uv, in my testing.)<p>Anyway, your analysis of causes reads like something AI generated and pasted in. It's awkward in the context of the rest of your post, and 2 of the 3 points are clearly irrelevant to a "hello world" benchmark.
A python file with<p><pre><code> import requests
</code></pre>
Takes 250ms on my i9 on python 3.13<p>A go program with<p><pre><code> package main
import (
_ "net/http"
)
func main() {
}
</code></pre>
takes < 10ms.
Just a guess - but perhaps the startup time is before `time` is even imported?
It might not be the fastest but I suspect something weird is happening with python resolution.<p>For instance `uv run` has its own fair share of overhead.<p><pre><code> $ hyperfine --warmup 10 -L py "uv run python,~/.local/bin/python3.14,/usr/local/bin/python3.12,~/.local/share/uv/python/pypy-3.11.13-macos-aarch64-none/bin/pypy3.11" "{py} -c 'exit(0)'"
Benchmark 1: uv run python -c 'exit(0)'
Time (mean ± σ): 58.4 ms ± 19.3 ms [User: 26.4 ms, System: 21.7 ms]
Range (min … max): 48.2 ms … 138.0 ms 50 runs
Benchmark 2: ~/.local/bin/python3.14 -c 'exit(0)'
Time (mean ± σ): 13.3 ms ± 6.9 ms [User: 8.0 ms, System: 2.5 ms]
Range (min … max): 9.9 ms … 53.7 ms 174 runs
Benchmark 3: /usr/local/bin/python3.12 -c 'exit(0)'
Time (mean ± σ): 16.4 ms ± 7.6 ms [User: 8.9 ms, System: 3.7 ms]
Range (min … max): 12.2 ms … 65.2 ms 152 runs
Benchmark 4: ~/.local/share/uv/python/pypy-3.11.13-macos-aarch64-none/bin/pypy3.11 -c 'exit(0)'
Time (mean ± σ): 18.6 ms ± 7.4 ms [User: 10.0 ms, System: 5.0 ms]
Range (min … max): 14.4 ms … 63.5 ms 138 runs
Summary
~/.local/bin/python3.14 -c 'exit(0)' ran
1.23 ± 0.86 times faster than /usr/local/bin/python3.12 -c 'exit(0)'
1.40 ± 0.92 times faster than ~/.local/share/uv/python/pypy-3.11.13-macos-aarch64-none/bin/pypy3.11 -c 'exit(0)'
4.40 ± 2.72 times faster than uv run python -c 'exit(0)'</code></pre>
Run strace on Python starting up- you will see it statting hundreds if not thousands of files. That gets much worse the slower your filesystem is.<p>On my linux system where all the file attributes are cached, it takes about 12ms to completely start, run a pass statement, and exit.
I don't know why people care so much about a few hundreds of <i>milliseconds</i> for python scripts versus compiled languages that take just ten times less.<p>Real question : what would you do more with the spared time ? You are that in a hurry in your life ?
Completely agree on this.<p>Regarding cold-starts, I strongly believe V8 snapshots are perhaps not the best way to achieve fast cold starts with Python (they may be if you are tied to using V8, though!), and will have wide side effects if you go out of the standards packages included on the Pyodide bundle.<p>To put some perspective: V8 snapshots are storing the whole state of an application (including it's compiled modules). This means that for a Python package that is using Python (one wasm module) + Pydantic-core (one wasm module) + FastAPI... all of those will be included in one snapshot (as well as the application state). This makes sense for browsers, where you want to be able to inspect/recover everything at once.<p>The issue about this design is that the compiled artifacts and the application state are bundled into one piece artifact (this is not great for AOT designed runtimes, but might be the optimal design for JITs though).<p>Ideally, you would separate each of the compiled modules from the state of the application. When you do this, you have some advantages: you can deserialize the compiled modules in parallel, and untie the "deserialization" from recovering the state of the application. This design doesn't adapt that well into the V8 architecture (and how it compiles stuff) when JavaScript is the main driver of the execution, however it's ideal when you just use WebAssembly.<p>This is what we have done at Wasmer, which allows for much faster cold starts than 1 second. Because we cache each of the compiled modules separately, and recover the state of the application later, we can achieve cold-starts that are a magnitude faster than Cloudflare's state of the art (when using pydantic, fastapi and httpx).<p>If anyone is curious, here is a blogpost where we presented fast-cold starts for the application state (note that the deserialization technique for Wasm modules is applied automatically in Wasmer, and we don't showcase it on the blogpost): <a href="https://wasmer.io/posts/announcing-instaboot-instant-cold-starts-for-serverless-apps" rel="nofollow">https://wasmer.io/posts/announcing-instaboot-instant-cold-st...</a><p>Note aside: congrats to the Cloudflare team on their work on Python on Workers, it's inspiring to all providers on the space... keep it up and let's keep challenging the status quo!
Big packages shouldn’t be imported until the cli has been parsed, and handed off to main. There’s been work to do this automatically, but it’s good hygiene to avoid it anyway.<p>A modern machine shouldn’t take this long, so likely something big is being imported unnecessarily at startup. If the big package itself is the issue, file it on their tracker.
Are you comparing the startup time of an interpreted language with the startup time of a compiled language? or you mean that `time python hello.py` > `( time gcc -O2 -o hello hello.c ) && ( time ./hello )` ?
I'm referring to the startup time as benchmarked in the following manner: <a href="https://github.com/bdrung/startup-time" rel="nofollow">https://github.com/bdrung/startup-time</a>
Here's the thing - I don't really care if its' because the interpreter has to start up, or there's a remote http call, or we scan the disks for integrity - the end user experience on every run is slower.
You can run .pyc stuff “directly” with some creativity, and there are some tools to pack “executables” that are just chunked blobs of bytecode.
it depends somewhat on what you import, too. some people would sell their grandmothers to get below 1s when you start importing numpys and scikits.
Reminds me of mercurial cvs!!