Instructions to use FreedomAISVR/gpt-oss-20B-NVFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FreedomAISVR/gpt-oss-20B-NVFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FreedomAISVR/gpt-oss-20B-NVFP4-GGUF",
	filename="gpt-oss-20b-nvfp4.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use FreedomAISVR/gpt-oss-20B-NVFP4-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4
# Run inference directly in the terminal:
llama-cli -hf FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4
# Run inference directly in the terminal:
llama-cli -hf FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4
# Run inference directly in the terminal:
./llama-cli -hf FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4

Use Docker

docker model run hf.co/FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4

LM Studio
Jan
Ollama
How to use FreedomAISVR/gpt-oss-20B-NVFP4-GGUF with Ollama:
```
ollama run hf.co/FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4
```

Unsloth Studio

How to use FreedomAISVR/gpt-oss-20B-NVFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FreedomAISVR/gpt-oss-20B-NVFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FreedomAISVR/gpt-oss-20B-NVFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open /spaces/unsloth/studio in your browser
# Search for FreedomAISVR/gpt-oss-20B-NVFP4-GGUF to start chatting

How to use FreedomAISVR/gpt-oss-20B-NVFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FreedomAISVR/gpt-oss-20B-NVFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use FreedomAISVR/gpt-oss-20B-NVFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4
```

Lemonade

How to use FreedomAISVR/gpt-oss-20B-NVFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FreedomAISVR/gpt-oss-20B-NVFP4-GGUF:NVFP4

Run and chat with the model

lemonade run user.gpt-oss-20B-NVFP4-GGUF-NVFP4

List all available models

lemonade list

GPT-OSS 20B — NVFP4 (Expert-Selective, Thinking Opt-In)

Repository: FreedomAISVR/gpt-oss-20B-NVFP4-GGUF
Source model: openai/gpt-oss-20b
Quantization: NVFP4 experts + Q8_0 non-experts (Blackwell-optimized)

Model Details

GPT-OSS is a 20B-parameter mixture-of-experts (MoE) language model developed by OpenAI, with 2.8B active parameters per token. It uses a 128-expert MoE layer (top-2 routing) with a 28-layer transformer architecture. This is an NVFP4 + Q8_0 hybrid — MoE expert weights are NVFP4, all other tensors are Q8_0.

Architecture

Parameter	Value
Total parameters	20.2B
Active parameters	2.8B
Layers	28
Attention heads	36
KV heads	6 (Grouped-Query Attention)
Hidden dimension	2880
Intermediate dimension	7680
MoE experts	128 (top-2 routing)
Context length	131,072 tokens
Vocabulary size	200,064
Position encoding	Rotary (RoPE, base=10,000)

Recommended Inference Configuration

{
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 32768
}

Quantization Details

This repository uses a hybrid quantization approach:

NVFP4 quantized: MoE expert weights (72 tensors: ffn_gate_exps, ffn_up_exps, ffn_down_exps — 3 per block × 24 blocks, each 142 MiB)
Q8_0 quantized: All non-expert tensors — attention projections, router, embeddings, layer norms, LM head, biases (387 tensors)

The NVFP4 expert weights benefit from Blackwell GPU hardware acceleration for 4-bit matrix multiplication. Q8_0 for non-expert tensors provides a good balance between quality and size. OpenAI's GPT-OSS was post-trained with MXFP4 quantization baked into expert weights, so these are requantized from MXFP4 to NVFP4.

Tensor Group	Tensor Count	Source Type	Quantized Type
MoE expert weights	72	MXFP4	NVFP4 (4-bit)
Attention projections	84	F16	Q8_0 (8-bit)
Router weights	28	F32	Q8_0 (8-bit)
Layer norms	57	F32	Q8_0 (8-bit)
Embeddings + LM head	2	F16	Q8_0 (8-bit)
Biases + output norm	161	F32	Q8_0 (8-bit)
Attention sinks	24	F32	Q8_0 (8-bit)

File Details

File	Size	BPW	Description
`gpt-oss-20b-nvfp4.gguf`	11.83 GB	4.86	NVFP4 experts + Q8_0 non-experts hybrid

Performance

On NVIDIA Blackwell GPUs (RTX 5060 Ti and higher), the NVFP4 expert weights benefit from hardware-accelerated 4-bit matrix multiplication, while non-expert tensors run at Q8_0 throughput.

Chat Template

The chat template uses opt-in reasoning — reasoning_effort is only applied when explicitly set by the user. This matches the original OpenAI model behavior where no "Reasoning:" instruction is injected into the system prompt.

// No reasoning instruction by default
// Set for explicit control:
reasoning_effort: "low"    // minimal chain-of-thought
reasoning_effort: "medium" // balanced reasoning
reasoning_effort: "high"   // thorough reasoning

Compatibility

This GGUF file is compatible with:

llama.cpp (commit b93186b or later)
LM Studio (0.3.10 or later)
Ollama, text-generation-webui, and other GGUF-compatible inference engines

Blackwell GPU recommended for NVFP4 hardware acceleration.

License

Apache 2.0 (same as the original OpenAI GPT-OSS model)

Downloads last month: 1,530

GGUF

Model size

21B params

Architecture

gpt-oss

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FreedomAISVR/gpt-oss-20B-NVFP4-GGUF

Base model

openai/gpt-oss-20b

Quantized

(203)

this model