GPT-OSS 20B β€” NVFP4 (Expert-Selective, Thinking Opt-In)

Repository: FreedomAISVR/gpt-oss-20B-NVFP4-GGUF
Source model: openai/gpt-oss-20b
Quantization: NVFP4 experts + Q8_0 non-experts (Blackwell-optimized)

Model Details

GPT-OSS is a 20B-parameter mixture-of-experts (MoE) language model developed by OpenAI, with 2.8B active parameters per token. It uses a 128-expert MoE layer (top-2 routing) with a 28-layer transformer architecture. This is an NVFP4 + Q8_0 hybrid β€” MoE expert weights are NVFP4, all other tensors are Q8_0.

Architecture

Parameter Value
Total parameters 20.2B
Active parameters 2.8B
Layers 28
Attention heads 36
KV heads 6 (Grouped-Query Attention)
Hidden dimension 2880
Intermediate dimension 7680
MoE experts 128 (top-2 routing)
Context length 131,072 tokens
Vocabulary size 200,064
Position encoding Rotary (RoPE, base=10,000)

Recommended Inference Configuration

{
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 32768
}

Quantization Details

This repository uses a hybrid quantization approach:

  • NVFP4 quantized: MoE expert weights (72 tensors: ffn_gate_exps, ffn_up_exps, ffn_down_exps β€” 3 per block Γ— 24 blocks, each 142 MiB)
  • Q8_0 quantized: All non-expert tensors β€” attention projections, router, embeddings, layer norms, LM head, biases (387 tensors)

The NVFP4 expert weights benefit from Blackwell GPU hardware acceleration for 4-bit matrix multiplication. Q8_0 for non-expert tensors provides a good balance between quality and size. OpenAI's GPT-OSS was post-trained with MXFP4 quantization baked into expert weights, so these are requantized from MXFP4 to NVFP4.

Tensor Group Tensor Count Source Type Quantized Type
MoE expert weights 72 MXFP4 NVFP4 (4-bit)
Attention projections 84 F16 Q8_0 (8-bit)
Router weights 28 F32 Q8_0 (8-bit)
Layer norms 57 F32 Q8_0 (8-bit)
Embeddings + LM head 2 F16 Q8_0 (8-bit)
Biases + output norm 161 F32 Q8_0 (8-bit)
Attention sinks 24 F32 Q8_0 (8-bit)

File Details

File Size BPW Description
gpt-oss-20b-nvfp4.gguf 11.83 GB 4.86 NVFP4 experts + Q8_0 non-experts hybrid

Performance

On NVIDIA Blackwell GPUs (RTX 5060 Ti and higher), the NVFP4 expert weights benefit from hardware-accelerated 4-bit matrix multiplication, while non-expert tensors run at Q8_0 throughput.

Chat Template

The chat template uses opt-in reasoning β€” reasoning_effort is only applied when explicitly set by the user. This matches the original OpenAI model behavior where no "Reasoning:" instruction is injected into the system prompt.

// No reasoning instruction by default
// Set for explicit control:
reasoning_effort: "low"    // minimal chain-of-thought
reasoning_effort: "medium" // balanced reasoning
reasoning_effort: "high"   // thorough reasoning

Compatibility

This GGUF file is compatible with:

  • llama.cpp (commit b93186b or later)
  • LM Studio (0.3.10 or later)
  • Ollama, text-generation-webui, and other GGUF-compatible inference engines

Blackwell GPU recommended for NVFP4 hardware acceleration.

License

Apache 2.0 (same as the original OpenAI GPT-OSS model)

Downloads last month
1,530
GGUF
Model size
21B params
Architecture
gpt-oss
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FreedomAISVR/gpt-oss-20B-NVFP4-GGUF

Quantized
(203)
this model