Whisper large-v3-turbo ONNX 4-graph

ONNX export of openai/whisper-large-v3-turbo as 4 split graphs for use with asrjs/speech-recognition.

Graph structure

Graph	Input	Output	Purpose
`encoder`	mel spectrogram	encoder hidden states	Audio encoding
`decoder_init`	start token + encoder states	logits + KV cache	First decode step
`decoder_step`	prev token + encoder states + KV cache	logits + KV cache	Recurrent decode
`decoder_align`	encoder states + tokens	cross-attention weights	Word timestamps (DTW)

fp16 fix

The original fp16 encoder had 1.2 GB inline weights, which:

Failed in ORT Web/WASM (std::bad_alloc — exceeds ~1.5 GB WASM heap)
Blocked persistent multi-session lifecycle (needed for streaming)

All fp16 graphs were converted to external data format:

Graph	ONNX size	External data
encoder	0.4 MB	1.2 GB
decoder_init	0.4 MB	254 MB
decoder_step	0.4 MB	127 MB
decoder_align	0.4 MB	127 MB

The fix was published as ysdede/whisper-large-v3-turbo-onnx-4graph-v2 (now private) and merged back here.

Inference features

Feature	Description
Greedy decoding	Argmax token selection, single pass
Beam search	Configurable beam size, best-of N, patience, length penalty
Temperature fallback	Temperature escalation 0.0 → 0.2 → 0.4 → 0.6 → 0.8 → 1.0
Word timestamps	DTW alignment via decoder_align cross-attention weights
Language detection	Decoder probability over language tokens (first 30 s)
Token suppression	No-timestamps, silence tokens, custom suppress tokens
Context conditioning	Condition on previous text / prompt
Mixed precision	q8 encoder + fp32 decoder (encoder ~25% faster, no decoder overhead)
3 backends	Native ORT, WebGPU (browser), WASM (fallback)

Precision variants

Variant	Size	Input dtype	Speed (12 s audio, native ORT)
fp32	4.5 GB	float32	1.0× (baseline)
fp16	2.3 GB	float16	Pending benchmark
q8	1.4 GB	float32	~1.25× vs fp32 (encoder)
Mixed (q8 enc + fp32 dec)	—	—	~1.46× vs fp32

Backend support

Backend	Status	Notes
onnxruntime-node (native)	✅ Primary	Persistent sessions, streaming lifecycle
WebGPU	✅ Validated	fp16, browser, ~24.5 s for 12 s audio
ORT Web/WASM	⚠️ Limited	~1.5 GB heap — sequential only, single session

Smoke test

# Native ORT (fastest, persistent)
node tests/smoke/whisper-large-v3-turbo-native.mjs --mixed

# WASM (sequential, fallback)
node tests/smoke/whisper-large-v3-turbo-wasm.mjs

# Custom model directory
WHISPER_LARGE_DIR=/path/to/fp16 node tests/smoke/whisper-large-v3-turbo-native.mjs

ysdede
/

whisper-large-v3-turbo-onnx-4graph