How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
llama-cli -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF:MXFP4_MOE
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
llama-cli -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF:MXFP4_MOE
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
./llama-cli -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF:MXFP4_MOE
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF:MXFP4_MOE
Use Docker
docker model run hf.co/FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF:MXFP4_MOE
Quick Links

gpt-oss-20B-MXFP4-MOE-GGUF

MXFP4_MOE GGUF quantization of openai/gpt-oss-20b.

About MXFP4_MOE

MXFP4_MOE is the MoE variant of MXFP4 (OCP Microscaling FP4 E2M1 format). It applies a mixed-precision quantization:

Tensor Group Quantization Bits/Param
Expert FFN weights MXFP4 (block FP4) ~4.25
Attention weights Q8_0 8
Router/norms/biases F32 32

Total: ~4.63 bits/param β€” the file is 11.28 GiB for 20.91B total parameters.

MXFP4 runs on any GPU or CPU (unlike NVFP4 which requires Blackwell).

Files

Filename Type Size Description
gpt-oss-20B-MXFP4_MOE.gguf MXFP4_MOE 11.28 GiB Main model weights

Usage

llama.cpp CLI

llama-cli -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF -cnv -p "You are a helpful assistant"

llama-server

llama-server -hf FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF --ctx-size 0 --jinja

llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF",
    filename="gpt-oss-20B-MXFP4_MOE.gguf",
)

Quantization Pipeline

# 1. Convert HF model to intermediate GGUF
python convert_hf_to_gguf.py ./models/gpt-oss-20b/ --outfile gpt-oss-20b-f16.gguf --outtype f16

# 2. Quantize to MXFP4_MOE (requires --allow-requantize due to converter auto-type assignment)
llama-quantize --allow-requantize gpt-oss-20b-f16.gguf gpt-oss-20b-mxfp4_moe.gguf MXFP4_MOE

Hardware

GPU VRAM Notes
NVIDIA RTX 5060 Ti 16 GB Quantization performed on this GPU

License

Apache-2.0 (same as openai/gpt-oss-20b)

Downloads last month
945
GGUF
Model size
21B params
Architecture
gpt-oss
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FreedomAISVR/gpt-oss-20B-MXFP4-MOE-GGUF

Quantized
(203)
this model