DeepSeek V4 Flash MLX Q3 Mixed

This is an MLX conversion of deepseek-ai/DeepSeek-V4-Flash.

Source

Base model: deepseek-ai/DeepSeek-V4-Flash
Source revision: 6e763230a9d263eca2023f1d4a5ce1bfe126cf48
Architecture: DeepseekV4ForCausalLM
Model type: deepseek_v4

Conversion Recipe

Tooling branch: Thump604/mlx-lm, branch deepseek-v4-support-fixes
Minimum tooling commit for generation: 9c990f4
Output path during conversion: /Volumes/Lexar/mlx_models/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine
Quantization recipe: mixed_3_6
Quantization mode: affine
Group size: 128
Effective bits per weight reported by MLX: 3.808
Shards: 28
Indexed MLX tensor size: 135,346,422,876 bytes

The mixed recipe uses 3-bit affine quantization for lower-risk routed expert paths and 6-bit affine quantization for sensitive paths including embeddings, LM head, attention projections, compressed-attention/indexer components, shared experts, and selected down projections.

Validation

Conversion completed successfully.
Lazy MLX load completed successfully on a 128GB Mac Studio.
One-token generation smoke was attempted and stopped after memory pressure and swap activity exceeded the local safety boundary. Treat this artifact as converted and load-validated, not generation-qualified on 128GB Apple Silicon.

Notes

DeepSeek V4 support in MLX is still under active development. This artifact was produced with local DeepSeek V4 support fixes, including FP4/FP8 checkpoint handling, F8_E8M0 scale metadata reinterpretation as raw uint8 exponent bytes before sanitizer decode, attention sink dtype handling, and quantized grouped output projection support.

Downloads last month: 1,138

Safetensors

Model size

284B params

Tensor type

BF16

U32

F32

I64

MLX

Hardware compatibility

4-bit

Model tree for Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(71)

this model