gpt-oss-20B-MXFP4-MOE-GGUF

MXFP4_MOE GGUF quantization of openai/gpt-oss-20b.

About MXFP4_MOE

MXFP4_MOE is the MoE variant of MXFP4 (OCP Microscaling FP4 E2M1 format). It applies a mixed-precision quantization:

Tensor Group Quantization Bits/Param
Expert FFN weights MXFP4 (block FP4) ~4.25
Attention weights Q8_0 8
Router/norms/biases F32 32

Total: ~4.63 bits/param β€” the file is 11.28 GiB for 20.91B total parameters.

MXFP4 runs on any GPU or CPU (unlike NVFP4 which requires Blackwell).

Files

Filename Type Size Description
gpt-oss-20B-MXFP4_MOE.gguf MXFP4_MOE 11.28 GiB Main model weights

Usage

llama.cpp CLI

llama-cli -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF -cnv -p "You are a helpful assistant"

llama-server

llama-server -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF --ctx-size 0 --jinja

llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF",
    filename="gpt-oss-20B-MXFP4_MOE.gguf",
)

Quantization Pipeline

# 1. Convert HF model to intermediate GGUF
python convert_hf_to_gguf.py ./models/gpt-oss-20b/ --outfile gpt-oss-20b-f16.gguf --outtype f16

# 2. Quantize to MXFP4_MOE (requires --allow-requantize due to converter auto-type assignment)
llama-quantize --allow-requantize gpt-oss-20b-f16.gguf gpt-oss-20b-mxfp4_moe.gguf MXFP4_MOE

Hardware

GPU VRAM Notes
NVIDIA RTX 5060 Ti 16 GB Quantization performed on this GPU

License

Apache-2.0 (same as openai/gpt-oss-20b)

Downloads last month
945
GGUF
Model size
21B params
Architecture
gpt-oss
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF

Quantized
(203)
this model