YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Whisper large-v3-turbo ONNX 4-graph
ONNX export of openai/whisper-large-v3-turbo as 4 split graphs for use with asrjs/speech-recognition.
Graph structure
| Graph | Input | Output | Purpose |
|---|---|---|---|
encoder |
mel spectrogram | encoder hidden states | Audio encoding |
decoder_init |
start token + encoder states | logits + KV cache | First decode step |
decoder_step |
prev token + encoder states + KV cache | logits + KV cache | Recurrent decode |
decoder_align |
encoder states + tokens | cross-attention weights | Word timestamps (DTW) |
fp16 fix
The original fp16 encoder had 1.2 GB inline weights, which:
- Failed in ORT Web/WASM (
std::bad_allocβ exceeds ~1.5 GB WASM heap) - Blocked persistent multi-session lifecycle (needed for streaming)
All fp16 graphs were converted to external data format:
| Graph | ONNX size | External data |
|---|---|---|
| encoder | 0.4 MB | 1.2 GB |
| decoder_init | 0.4 MB | 254 MB |
| decoder_step | 0.4 MB | 127 MB |
| decoder_align | 0.4 MB | 127 MB |
The fix was published as ysdede/whisper-large-v3-turbo-onnx-4graph-v2 (now private) and merged back here.
Inference features
| Feature | Description |
|---|---|
| Greedy decoding | Argmax token selection, single pass |
| Beam search | Configurable beam size, best-of N, patience, length penalty |
| Temperature fallback | Temperature escalation 0.0 β 0.2 β 0.4 β 0.6 β 0.8 β 1.0 |
| Word timestamps | DTW alignment via decoder_align cross-attention weights |
| Language detection | Decoder probability over language tokens (first 30 s) |
| Token suppression | No-timestamps, silence tokens, custom suppress tokens |
| Context conditioning | Condition on previous text / prompt |
| Mixed precision | q8 encoder + fp32 decoder (encoder ~25% faster, no decoder overhead) |
| 3 backends | Native ORT, WebGPU (browser), WASM (fallback) |
Precision variants
| Variant | Size | Input dtype | Speed (12 s audio, native ORT) |
|---|---|---|---|
| fp32 | 4.5 GB | float32 | 1.0Γ (baseline) |
| fp16 | 2.3 GB | float16 | Pending benchmark |
| q8 | 1.4 GB | float32 | ~1.25Γ vs fp32 (encoder) |
| Mixed (q8 enc + fp32 dec) | β | β | ~1.46Γ vs fp32 |
Backend support
| Backend | Status | Notes |
|---|---|---|
| onnxruntime-node (native) | β Primary | Persistent sessions, streaming lifecycle |
| WebGPU | β Validated | fp16, browser, ~24.5 s for 12 s audio |
| ORT Web/WASM | β οΈ Limited | ~1.5 GB heap β sequential only, single session |
Smoke test
# Native ORT (fastest, persistent)
node tests/smoke/whisper-large-v3-turbo-native.mjs --mixed
# WASM (sequential, fallback)
node tests/smoke/whisper-large-v3-turbo-wasm.mjs
# Custom model directory
WHISPER_LARGE_DIR=/path/to/fp16 node tests/smoke/whisper-large-v3-turbo-native.mjs
Links
- asrjs/speech-recognition β TypeScript speech recognition library
- Downloads last month
- 202
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support