Instructions to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF", filename="Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS # Run inference directly in the terminal: llama-cli -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS # Run inference directly in the terminal: llama-cli -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS # Run inference directly in the terminal: ./llama-cli -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS # Run inference directly in the terminal: ./build/bin/llama-cli -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
Use Docker
docker model run hf.co/k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
- LM Studio
- Jan
- vLLM
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
- Ollama
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Ollama:
ollama run hf.co/k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
- Unsloth Studio
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open /spaces/unsloth/studio in your browser # Search for k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF to start chatting
- Pi
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Docker Model Runner:
docker model run hf.co/k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
- Lemonade
How to use k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF:IQ4_XS
Run and chat with the model
lemonade run user.Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF-IQ4_XS
List all available models
lemonade list
Qwen3.6-27B-Omnimerge-v4 — IQ4_XS (Mixed-Bit, 12.6 GiB)
Runtime note: This quant was built and tested with ikawrakow's
ik_llama.cppfork. It has not been tested on the mainline ggml/llama.cpp. For best results and full feature support (mixed-bit quant loading), useik_llama.cpp.
Base model: OmniMerge-v4 on Qwen3.6-27B
Size: 12.76 GiB (13.7 GB, 13,054 MiB on disk) — 4.06 bpw
VRAM: Fits 16 GB GPUs comfortably with 32k+ context
License: Apache-2.0
A custom importance-matrix-guided mixed-bit quantization of the OmniMerge-v4 merge. Achieves lower perplexity than the official uniform IQ4_XS release while being 1.3 GiB smaller — a clear win for targeted bit allocation at this size tier.
Results
Perplexity (wiki.test.raw, 580 chunks, n_ctx=512)
| Model | PPL | Size | Δ from official IQ4_XS |
|---|---|---|---|
| IQ4_XS (mixed-bit, ours) | 6.864 ± 0.045 | 12.76 GiB | −0.055 |
| IQ4_XS (uniform, official) | 6.919 ± 0.045 | 14.05 GiB | — |
| F16 (estimated) | ~6.70 | 50.11 GiB | — |
- Both quants lose ~0.15–0.20 PPL from the F16 baseline.
- The mixed-bit recipe recovers ~25% of that loss compared to uniform IQ4_XS, while using 9% less disk space
KLD vs Upstream IQ4_XS
| Metric | Value | Interpretation |
|---|---|---|
| Mean KLD | 0.0371 ± 0.0015 | Low — distributions are very similar |
| Same top-p | 92.31% ± 0.17% | 92% of tokens agree on the most likely next token |
| Correlation | 99.25% | Confidence scores move almost perfectly in sync |
| Mean Δp | −0.026% ± 0.033% | No systematic bias (neither quant is consistently over/under-confident) |
| RMS Δp | 5.23% ± 0.13% | Per-token probability differences average ~5% |
Percentile breakdown of disagreement (the 8% where top-p differs):
| Percentile | KLD threshold | Meaning |
|---|---|---|
| 50% (median) | 0.013 | Half of KLD mass is negligible |
| 90% | 0.063 | 90% of tokens are very close |
| 95% | 0.107 | 95% with minor divergence |
| 99% | 0.369 | 1% show noticeable divergence |
| 99.9% | 2.08 | Rare outliers — 0.1% of tokens |
The distribution is strongly right-skewed: the two quants behave nearly identically on the vast majority of tokens, with meaningful divergence only in a thin tail.
HumanEval — Corrected Score
Initial results were corrupted by a bug in lm_eval's model="gguf" type which calls .strip() on completions, removing the 4-space indentation from the first line of every Python function body. This caused nearly all samples to fail ast.parse().
After correcting the indentation (fix_completion in rescore script) and re-running code_eval:
| Metric | Value | Notes |
|---|---|---|
| Preliminary corrected pass@1 | 90.67% (68/75) | ⚠️ NOT a standard score — 75/164 problems, non-standard eval pipeline |
| Bugged score (reference) | 2.67% (2/75) | .strip() destroyed function structure |
| Compile OK before fix | 34/75 (45.3%) | Logic was often correct, syntax broken |
| Compile OK after fix | 75/75 (100%) | Indentation was the only issue |
⛔ This is NOT a standard HumanEval pass@1 score.
- Only 75 of 164 problems were scored (problem #76 hung during generation)
- Uses
create_testfilter (test-generation evaluation pipeline), not standardpass@1on code solutions - Not comparable to the official OmniMerge-v4 benchmark (84.76% at Q6_K, all 164 problems, with
--reasoning-format deepseek --reasoning-budget 8192) - A full 164-problem run with standard
pass@1would give a definitive score
Position in Official Quant Ladder
| Quant | Size | Tier |
|---|---|---|
| F16 (source) | 50.11 GiB | Lossless |
| IQ4_XS (uniform, official) | 14.05 GiB | Best uniform 4-bit |
| IQ4_XS (mixed-bit, ours) | 12.76 GiB | Beats IQ4_XS in PPL at 9% smaller |
| Q3_K_L (official) | 13.36 GiB | Standard 3-bit upper tier |
My quant occupies a sweet spot: better perplexity than the larger IQ4_XS, smaller file than the smaller Q3_K_L.
Observations
- Mixed-bit wins at this size tier. At ~12.6 GiB, the optimizer can shift bits from low-impact FFN gates (iq3) to attention and output layers (iq5), producing measurably better quality than any uniform quantization at this or even 9% larger size.
- High behavioral agreement with uniform IQ4_XS. The 92% same-top-p and 0.037 KLD mean the two quants behave nearly identically per-token. The difference is a subtle quality uplift, not a different model.
- The HumanEval bug underscores a tooling gap.
lm_eval's gguf model type strips whitespace, which is destructive for Python code. Thelocal-completionsmodel type doesn't have this issue. The preliminary 90.67% is indicative but not a final score. - KLD here compares two quants, not quant-vs-F16. The absolute loss from the original model is unknown without loading the 50 GiB F16 reference. PPL (lower is better) is the more informative absolute metric.
Quantization Design
Why Mixed-Bit
The upstream ManniX-ITA GGUF ladder provides uniform quants — every tensor gets the same type. This quant instead uses 50 per-tensor regex rules guided by an importance matrix, allocating higher precision where it matters and compressing where it doesn't.
Key Decision: Output Weight "Safety Patch"
The GGUF-Tool-Suite optimizer's KLD-guided allocation chose iq4_k for output.weight. I overrode this to iq5_k — a deliberate +152 MB investment in the model's final linear layer, where every token's logits pass through. Under-quantizing the output head creates systematic logit errors that compound across generations. This attempted to mimic other Qwen3.6 quant attempts seen that tend to universally protect this tensor at q8_0 or higher.
Post-hoc note: Reference recipes at comparable BPW tiers suggest the optimizer would have preferred removing IQ3 from mid-layer FFN over upgrading output.weight. The override was a safety-first decision — the practical impact is small (PPL 6.86 speaks for itself), but in hindsight the optimizer's default allocation was likely just as good at this bit budget.
Mixed-Bit Strategy
| Where | Precision | Why |
|---|---|---|
| Norms, biases, SSM params | f32 (353 tensors) | Numerical stability |
| SSM alpha/beta (early layers) | q8_0 (58 tensors) | State dynamics need fidelity |
| SSM alpha/beta (deep layers) | iq6_k / q6_K (38 tensors) | State tracking accumulates |
| Early attention layers (0–3) | iq5_k | Anchor layers, high impact |
| Full-attention blocks (every 4th) | iq5_k for K/V/output | Every 4th layer gets full attention |
| Bulk FFN + attention | iq4_kt (257 tensors, 71%) | Standard 4-bit, importance-guided |
| Deeper FFN gates | iq3_k / iq3_kt (18 tensors) | Where compression hurts least |
Per-Tensor Type Distribution
| QTYPE | Count | Use Case |
|---|---|---|
f32 |
353 | Norms, biases, SSM parameters |
iq4_kt |
257 | Bulk FFN + attention weights |
iq4_k |
56 | Selected FFN/attention |
q8_0 |
58 | SSM alpha/beta weights |
iq5_k |
38 | Critical attention, output/embedding |
iq4_ks |
33 | Selected smaller FFN/attention |
iq6_k |
18 | Deep SSM weights |
q6_K |
20 | Mid-layer SSM weights |
iq3_k |
7 | Deeper FFN gates |
iq3_kt |
11 | Embedding, early FFN |
Hardware & Performance
VRAM & Context Limits
Tested on NVIDIA RTX 5070 Ti (16 GB, 16,302 MiB total) using q8_0 KV cache with flash attention.
Measured at 42k context (llama-sweep-bench, -c 42000):
| Component | Size |
|---|---|
| Model tensors (CUDA0) | 11,746 MiB |
| Compute buffer (CUDA0) | 505 MiB |
| KV cache (q8_0, 42,240 tok) | 1,552 MiB (self: 1,403 MiB) |
| Total GPU | ~13,803 MiB |
KV cache per token (q8_0): 33.2 KB self size, ~36.8 KB with allocation overhead.
Practical max context: ~46k–48k — beyond this, the dynamic attention workspace (K×Q matrices, graph intermediates) competes with the remaining headroom and TG performance degrades sharply or OOM occurs. The naive linear extrapolation of KV cache alone suggests ~75k, but the real ceiling is set by the runtime compute workspace, which scales with context and is not pre-allocated.
| Context | Approx. total GPU VRAM | Notes |
|---|---|---|
| 8k | ~12,300 MiB | Comfortable, plenty of headroom |
| 16k | ~12,600 MiB | |
| 32k | ~13,200 MiB | |
| 42k | ~13,800 MiB | Confirmed by sweep-bench |
| ~48k | ~14,300 MiB | Practical max — attention workspace eats headroom |
| 64k+ | OOM / spill | Dynamic workspace exceeds available VRAM |
- Text generation with sweep-bench ~40–47 t/s varies with KV cache fill level
Measured with flash attention, RTX 5070 Ti, ngl=99, n_ubatch=512. PP speed ~225 t/s is for 512-token prompt batches at the model's full precision; the earlier ~1,750 t/s figure was from PPL eval with larger batches (n_batch=512) over long sequences, measuring sustained throughput rather than step latency.
TG starts at ~47 t/s with empty KV cache and gradually declines to ~40 t/s as context fills, with occasional dips to ~36 t/s at intermediate cache sizes due to compute graph reshuffling.
Usage
llama-server
Recommended (safe, plenty of headroom):
llama-server \
-m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
-c 32768 -ngl 99 -t 12 -ub 512 \
--jinja --reasoning-budget 1024 \
-ctk q8_0 -ctv q8_0 -fa
For maximum context (16 GB GPU, try 60k first):
llama-server \
-m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
-c 60000 -ngl 99 -t 12 -ub 512 -amb 256 \
--jinja --reasoning-budget 1024 \
-ctk q8_0 -ctv q8_0 -fa
Tighter still (aggressive VRAM use):
llama-server \
-m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
-c 60000 -ngl 99 -t 12 -ub 512 -amb 256 --fit --fit-margin 256 \
--jinja --reasoning-budget 1024 \
-ctk q8_0 -ctv q8_0 -fa
-ub 512is essential — keeps the physical batch size small, reducing dynamic attention workspace and allowing larger context before VRAM contention-amb 256caps the attention K×Q workspace at 256 MiB, preventing blowup at large context (if too low, attention computes in multiple slower passes)--fit --fit-margin 128lets the runtime use VRAM more aggressively (default safety margin is 1024 MiB); won't offload to CPU unless absolutely necessary- 32k context is the safe sweet spot: fits comfortably on 12–16 GB GPUs
- 60k–64k is the practical ceiling on 16 GB; start at 60k and increase if there's headroom
- For 12 GB GPUs, start at
-c 16384and reduce if needed
lm_eval (reproducible benchmarks)
lm_eval --model gguf \
--model_args base_url=http://localhost:8080,model_args=-c,8192,-ngl,99,--reasoning-format,deepseek,--reasoning-budget,8192 \
--tasks humaneval,mbpp,gpqa_diamond_generative_n_shot \
--output_path results/
Quantization Pipeline
Toolchain: ik_llama.cpp (official fork) via Thireus custom build (build 4758, GCC 11.4.0, Linux/WSL)
Importance matrix: imatrix-Qwen3.6-27B-BF16.dat — 497 entries, 829 chunks, ubergarm calibration corpus v02
Recipe: 48 per-tensor regex rules at recipes/Qwen3.6-27B-Omnimerge-v4-12.6GiB-IQ4_XS-13GiB.recipe.txt
Pipeline: Generate recipe (quant_assign.py) → Quantize (llama-quantize) → Evaluate (PPL, KLD)
Compression
| Metric | Value |
|---|---|
| Source (F16) | 50,111 MiB |
| Quant size | 13,054 MiB |
| Compression ratio | 3.93× |
Recipe Evolution
- Optimizer-generated →
output.weight=iq4_k, using Qwen3.6-27B KLD calibration - Manual override →
output.weight=iq5_k(regretted "safety patch", +152 MB)
Reproducing
# Quantize
llama-quantize \
--imatrix imatrix-Qwen3.6-27B-BF16.dat \
--override-kv general.quantization_version=2 \
Qwen3.6-27B-Omnimerge-v4-F16.gguf \
Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
Q8_0 \
local_quant/recipes/Qwen3.6-27B-Omnimerge-v4-12.6GiB-IQ4_XS-13GiB.recipe.txt
# Evaluate PPL
llama-perplexity -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
-f wiki.test.raw -ngl 99 -c 512 -b 512 -fa
# KLD vs reference
llama-perplexity -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
-f wiki.test.raw --kl-divergence -ngl 99 -c 512
Note: Q8_0 is the fallback type; actual quantization is determined by the recipe rules.
License & Credits
License: Apache-2.0 (inherited from Qwen/Qwen3.6-27B and OmniMerge-v4)
Source model: ManniX-ITA/Qwen3.6-27B-Omnimerge-v4 — DARE-TIES merge of Qwen3.6-27B with 3 fine-tunes, MLP-passthrough surgery.
Acknowledgements:
- ManniX-ITA — OmniMerge-v4 merge
- ubergarm — imatrix calibration corpus v02
- ggerganov/llama.cpp — upstream quantization framework
- ikawrakow/ik_llama.cpp — official fork with MTP, FlashMLA, advanced speculative decoding
- Thireus/ik_llama.cpp — custom build used for this quantization
- Thireus/GGUF-Tool-Suite — recipe generation and quantization orchestration tools
See Also
- Official OmniMerge-v4 GGUF release — full quant ladder
- Qwen3.6-27B base model — by Qwen team
- Downloads last month
- 16,822
4-bit
Model tree for k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF
Base model
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4