YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Whisper large-v3-turbo ONNX 4-graph

ONNX export of openai/whisper-large-v3-turbo as 4 split graphs for use with asrjs/speech-recognition.

Graph structure

Graph Input Output Purpose
encoder mel spectrogram encoder hidden states Audio encoding
decoder_init start token + encoder states logits + KV cache First decode step
decoder_step prev token + encoder states + KV cache logits + KV cache Recurrent decode
decoder_align encoder states + tokens cross-attention weights Word timestamps (DTW)

fp16 fix

The original fp16 encoder had 1.2 GB inline weights, which:

  • Failed in ORT Web/WASM (std::bad_alloc β€” exceeds ~1.5 GB WASM heap)
  • Blocked persistent multi-session lifecycle (needed for streaming)

All fp16 graphs were converted to external data format:

Graph ONNX size External data
encoder 0.4 MB 1.2 GB
decoder_init 0.4 MB 254 MB
decoder_step 0.4 MB 127 MB
decoder_align 0.4 MB 127 MB

The fix was published as ysdede/whisper-large-v3-turbo-onnx-4graph-v2 (now private) and merged back here.

Inference features

Feature Description
Greedy decoding Argmax token selection, single pass
Beam search Configurable beam size, best-of N, patience, length penalty
Temperature fallback Temperature escalation 0.0 β†’ 0.2 β†’ 0.4 β†’ 0.6 β†’ 0.8 β†’ 1.0
Word timestamps DTW alignment via decoder_align cross-attention weights
Language detection Decoder probability over language tokens (first 30 s)
Token suppression No-timestamps, silence tokens, custom suppress tokens
Context conditioning Condition on previous text / prompt
Mixed precision q8 encoder + fp32 decoder (encoder ~25% faster, no decoder overhead)
3 backends Native ORT, WebGPU (browser), WASM (fallback)

Precision variants

Variant Size Input dtype Speed (12 s audio, native ORT)
fp32 4.5 GB float32 1.0Γ— (baseline)
fp16 2.3 GB float16 Pending benchmark
q8 1.4 GB float32 ~1.25Γ— vs fp32 (encoder)
Mixed (q8 enc + fp32 dec) β€” β€” ~1.46Γ— vs fp32

Backend support

Backend Status Notes
onnxruntime-node (native) βœ… Primary Persistent sessions, streaming lifecycle
WebGPU βœ… Validated fp16, browser, ~24.5 s for 12 s audio
ORT Web/WASM ⚠️ Limited ~1.5 GB heap β€” sequential only, single session

Smoke test

# Native ORT (fastest, persistent)
node tests/smoke/whisper-large-v3-turbo-native.mjs --mixed

# WASM (sequential, fallback)
node tests/smoke/whisper-large-v3-turbo-wasm.mjs

# Custom model directory
WHISPER_LARGE_DIR=/path/to/fp16 node tests/smoke/whisper-large-v3-turbo-native.mjs

Links

Downloads last month
202
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support