Instructions to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF",
	filename="Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
# Run inference directly in the terminal:
llama-cli -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
# Run inference directly in the terminal:
llama-cli -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
# Run inference directly in the terminal:
./llama-cli -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
# Run inference directly in the terminal:
./build/bin/llama-cli -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS

Use Docker

docker model run hf.co/k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS

LM Studio
Jan

vLLM

How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS

Ollama
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Ollama:
```
ollama run hf.co/k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
```

Unsloth Studio

How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open /spaces/unsloth/studio in your browser
# Search for k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF to start chatting

How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Docker Model Runner:
```
docker model run hf.co/k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
```

Lemonade

How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS

Run and chat with the model

lemonade run user.Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF-IQ4_XS

List all available models

lemonade list

Qwen3.6-27B-Omnimerge-v4 — IQ4_XS (Mixed-Bit, 12.6 GiB)

Runtime note: This quant was built and tested with ikawrakow's ik_llama.cpp fork. It has not been tested on the mainline ggml/llama.cpp. For best results and full feature support (mixed-bit quant loading), use ik_llama.cpp.

Base model: OmniMerge-v4 on Qwen3.6-27B
Size: 12.76 GiB (13.7 GB, 13,054 MiB on disk) — 4.06 bpw
VRAM: Fits 16 GB GPUs comfortably with 32k+ context
License: Apache-2.0

A custom importance-matrix-guided mixed-bit quantization of the OmniMerge-v4 merge. Achieves lower perplexity than the official uniform IQ4_XS release while being 1.3 GiB smaller — a clear win for targeted bit allocation at this size tier.

Results

Perplexity (wiki.test.raw, 580 chunks, n_ctx=512)

Model	PPL	Size	Δ from official IQ4_XS
IQ4_XS (mixed-bit, ours)	6.864 ± 0.045	12.76 GiB	−0.055
IQ4_XS (uniform, official)	6.919 ± 0.045	14.05 GiB	—
F16 (estimated)	~6.70	50.11 GiB	—

Both quants lose ~0.15–0.20 PPL from the F16 baseline.
The mixed-bit recipe recovers ~25% of that loss compared to uniform IQ4_XS, while using 9% less disk space

KLD vs Upstream IQ4_XS

Metric	Value	Interpretation
Mean KLD	0.0371 ± 0.0015	Low — distributions are very similar
Same top-p	92.31% ± 0.17%	92% of tokens agree on the most likely next token
Correlation	99.25%	Confidence scores move almost perfectly in sync
Mean Δp	−0.026% ± 0.033%	No systematic bias (neither quant is consistently over/under-confident)
RMS Δp	5.23% ± 0.13%	Per-token probability differences average ~5%

Percentile breakdown of disagreement (the 8% where top-p differs):

Percentile	KLD threshold	Meaning
50% (median)	0.013	Half of KLD mass is negligible
90%	0.063	90% of tokens are very close
95%	0.107	95% with minor divergence
99%	0.369	1% show noticeable divergence
99.9%	2.08	Rare outliers — 0.1% of tokens

The distribution is strongly right-skewed: the two quants behave nearly identically on the vast majority of tokens, with meaningful divergence only in a thin tail.

HumanEval — Corrected Score

Initial results were corrupted by a bug in lm_eval's model="gguf" type which calls .strip() on completions, removing the 4-space indentation from the first line of every Python function body. This caused nearly all samples to fail ast.parse().

After correcting the indentation (fix_completion in rescore script) and re-running code_eval:

Metric	Value	Notes
Preliminary corrected pass@1	90.67% (68/75)	⚠️ NOT a standard score — 75/164 problems, non-standard eval pipeline
Bugged score (reference)	2.67% (2/75)	`.strip()` destroyed function structure
Compile OK before fix	34/75 (45.3%)	Logic was often correct, syntax broken
Compile OK after fix	75/75 (100%)	Indentation was the only issue

⛔ This is NOT a standard HumanEval pass@1 score.

Only 75 of 164 problems were scored (problem #76 hung during generation)
Uses create_test filter (test-generation evaluation pipeline), not standard pass@1 on code solutions
Not comparable to the official OmniMerge-v4 benchmark (84.76% at Q6_K, all 164 problems, with --reasoning-format deepseek --reasoning-budget 8192)
A full 164-problem run with standard pass@1 would give a definitive score

Position in Official Quant Ladder

Quant	Size	Tier
F16 (source)	50.11 GiB	Lossless
IQ4_XS (uniform, official)	14.05 GiB	Best uniform 4-bit
IQ4_XS (mixed-bit, ours)	12.76 GiB	Beats IQ4_XS in PPL at 9% smaller
Q3_K_L (official)	13.36 GiB	Standard 3-bit upper tier

My quant occupies a sweet spot: better perplexity than the larger IQ4_XS, smaller file than the smaller Q3_K_L.

Observations

Mixed-bit wins at this size tier. At ~12.6 GiB, the optimizer can shift bits from low-impact FFN gates (iq3) to attention and output layers (iq5), producing measurably better quality than any uniform quantization at this or even 9% larger size.
High behavioral agreement with uniform IQ4_XS. The 92% same-top-p and 0.037 KLD mean the two quants behave nearly identically per-token. The difference is a subtle quality uplift, not a different model.
The HumanEval bug underscores a tooling gap. lm_eval's gguf model type strips whitespace, which is destructive for Python code. The local-completions model type doesn't have this issue. The preliminary 90.67% is indicative but not a final score.
KLD here compares two quants, not quant-vs-F16. The absolute loss from the original model is unknown without loading the 50 GiB F16 reference. PPL (lower is better) is the more informative absolute metric.

Quantization Design

Why Mixed-Bit

The upstream ManniX-ITA GGUF ladder provides uniform quants — every tensor gets the same type. This quant instead uses 50 per-tensor regex rules guided by an importance matrix, allocating higher precision where it matters and compressing where it doesn't.

Key Decision: Output Weight "Safety Patch"

The GGUF-Tool-Suite optimizer's KLD-guided allocation chose iq4_k for output.weight. I overrode this to iq5_k — a deliberate +152 MB investment in the model's final linear layer, where every token's logits pass through. Under-quantizing the output head creates systematic logit errors that compound across generations. This attempted to mimic other Qwen3.6 quant attempts seen that tend to universally protect this tensor at q8_0 or higher.

Post-hoc note: Reference recipes at comparable BPW tiers suggest the optimizer would have preferred removing IQ3 from mid-layer FFN over upgrading output.weight. The override was a safety-first decision — the practical impact is small (PPL 6.86 speaks for itself), but in hindsight the optimizer's default allocation was likely just as good at this bit budget.

Mixed-Bit Strategy

Where	Precision	Why
Norms, biases, SSM params	f32 (353 tensors)	Numerical stability
SSM alpha/beta (early layers)	q8_0 (58 tensors)	State dynamics need fidelity
SSM alpha/beta (deep layers)	iq6_k / q6_K (38 tensors)	State tracking accumulates
Early attention layers (0–3)	iq5_k	Anchor layers, high impact
Full-attention blocks (every 4th)	iq5_k for K/V/output	Every 4th layer gets full attention
Bulk FFN + attention	iq4_kt (257 tensors, 71%)	Standard 4-bit, importance-guided
Deeper FFN gates	iq3_k / iq3_kt (18 tensors)	Where compression hurts least

Per-Tensor Type Distribution

QTYPE	Count	Use Case
`f32`	353	Norms, biases, SSM parameters
`iq4_kt`	257	Bulk FFN + attention weights
`iq4_k`	56	Selected FFN/attention
`q8_0`	58	SSM alpha/beta weights
`iq5_k`	38	Critical attention, output/embedding
`iq4_ks`	33	Selected smaller FFN/attention
`iq6_k`	18	Deep SSM weights
`q6_K`	20	Mid-layer SSM weights
`iq3_k`	7	Deeper FFN gates
`iq3_kt`	11	Embedding, early FFN

Hardware & Performance

VRAM & Context Limits

Tested on NVIDIA RTX 5070 Ti (16 GB, 16,302 MiB total) using q8_0 KV cache with flash attention.

Measured at 42k context (llama-sweep-bench, -c 42000):

Component	Size
Model tensors (CUDA0)	11,746 MiB
Compute buffer (CUDA0)	505 MiB
KV cache (q8_0, 42,240 tok)	1,552 MiB (self: 1,403 MiB)
Total GPU	~13,803 MiB

KV cache per token (q8_0): 33.2 KB self size, ~36.8 KB with allocation overhead.

Practical max context: ~46k–48k — beyond this, the dynamic attention workspace (K×Q matrices, graph intermediates) competes with the remaining headroom and TG performance degrades sharply or OOM occurs. The naive linear extrapolation of KV cache alone suggests ~75k, but the real ceiling is set by the runtime compute workspace, which scales with context and is not pre-allocated.

Context	Approx. total GPU VRAM	Notes
8k	~12,300 MiB	Comfortable, plenty of headroom
16k	~12,600 MiB
32k	~13,200 MiB
42k	~13,800 MiB	Confirmed by sweep-bench
~48k	~14,300 MiB	Practical max — attention workspace eats headroom
64k+	OOM / spill	Dynamic workspace exceeds available VRAM

Text generation with sweep-bench ~40–47 t/s varies with KV cache fill level

Measured with flash attention, RTX 5070 Ti, ngl=99, n_ubatch=512. PP speed ~225 t/s is for 512-token prompt batches at the model's full precision; the earlier ~1,750 t/s figure was from PPL eval with larger batches (n_batch=512) over long sequences, measuring sustained throughput rather than step latency.

TG starts at ~47 t/s with empty KV cache and gradually declines to ~40 t/s as context fills, with occasional dips to ~36 t/s at intermediate cache sizes due to compute graph reshuffling.

Usage

llama-server

Recommended (safe, plenty of headroom):

llama-server \
  -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  -c 32768 -ngl 99 -t 12 -ub 512 \
  --jinja --reasoning-budget 1024 \
  -ctk q8_0 -ctv q8_0 -fa

For maximum context (16 GB GPU, try 60k first):

llama-server \
  -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  -c 60000 -ngl 99 -t 12 -ub 512 -amb 256 \
  --jinja --reasoning-budget 1024 \
  -ctk q8_0 -ctv q8_0 -fa

Tighter still (aggressive VRAM use):

llama-server \
  -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  -c 60000 -ngl 99 -t 12 -ub 512 -amb 256 --fit --fit-margin 256 \
  --jinja --reasoning-budget 1024 \
  -ctk q8_0 -ctv q8_0 -fa

-ub 512 is essential — keeps the physical batch size small, reducing dynamic attention workspace and allowing larger context before VRAM contention
-amb 256 caps the attention K×Q workspace at 256 MiB, preventing blowup at large context (if too low, attention computes in multiple slower passes)
--fit --fit-margin 128 lets the runtime use VRAM more aggressively (default safety margin is 1024 MiB); won't offload to CPU unless absolutely necessary
32k context is the safe sweet spot: fits comfortably on 12–16 GB GPUs
60k–64k is the practical ceiling on 16 GB; start at 60k and increase if there's headroom
For 12 GB GPUs, start at -c 16384 and reduce if needed

lm_eval (reproducible benchmarks)

lm_eval --model gguf \
  --model_args base_url=http://localhost:8080,model_args=-c,8192,-ngl,99,--reasoning-format,deepseek,--reasoning-budget,8192 \
  --tasks humaneval,mbpp,gpqa_diamond_generative_n_shot \
  --output_path results/

Quantization Pipeline

Toolchain: ik_llama.cpp (official fork) via Thireus custom build (build 4758, GCC 11.4.0, Linux/WSL)
Importance matrix: imatrix-Qwen3.6-27B-BF16.dat — 497 entries, 829 chunks, ubergarm calibration corpus v02
Recipe: 48 per-tensor regex rules at recipes/Qwen3.6-27B-Omnimerge-v4-12.6GiB-IQ4_XS-13GiB.recipe.txt
Pipeline: Generate recipe (quant_assign.py) → Quantize (llama-quantize) → Evaluate (PPL, KLD)

Compression

Metric	Value
Source (F16)	50,111 MiB
Quant size	13,054 MiB
Compression ratio	3.93×

Recipe Evolution

Optimizer-generated → output.weight=iq4_k, using Qwen3.6-27B KLD calibration
Manual override → output.weight=iq5_k (regretted "safety patch", +152 MB)

Reproducing

# Quantize
llama-quantize \
  --imatrix imatrix-Qwen3.6-27B-BF16.dat \
  --override-kv general.quantization_version=2 \
  Qwen3.6-27B-Omnimerge-v4-F16.gguf \
  Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  Q8_0 \
  local_quant/recipes/Qwen3.6-27B-Omnimerge-v4-12.6GiB-IQ4_XS-13GiB.recipe.txt

# Evaluate PPL
llama-perplexity -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  -f wiki.test.raw -ngl 99 -c 512 -b 512 -fa

# KLD vs reference
llama-perplexity -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  -f wiki.test.raw --kl-divergence -ngl 99 -c 512

Note: Q8_0 is the fallback type; actual quantization is determined by the recipe rules.

License & Credits

License: Apache-2.0 (inherited from Qwen/Qwen3.6-27B and OmniMerge-v4)

Source model: ManniX-ITA/Qwen3.6-27B-Omnimerge-v4 — DARE-TIES merge of Qwen3.6-27B with 3 fine-tunes, MLP-passthrough surgery.

Acknowledgements:

ManniX-ITA — OmniMerge-v4 merge
ubergarm — imatrix calibration corpus v02
ggerganov/llama.cpp — upstream quantization framework
ikawrakow/ik_llama.cpp — official fork with MTP, FlashMLA, advanced speculative decoding
Thireus/ik_llama.cpp — custom build used for this quantization
Thireus/GGUF-Tool-Suite — recipe generation and quantization orchestration tools

Model tree for k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF

Base model

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4

Quantized

(6)

this model

Collection including k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF

k0valik/Qwen3.6-27B

Collection

My experimental IQ4_XS small size quants for Qwen3.6-27B so I can squeeze them into 16GB VRAM with relatively large context • 3 items • Updated 25 days ago

k0valik
/

Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF