Qwen3.6-27B-Omnimerge-v4 — IQ4_XS (Mixed-Bit, 12.6 GiB)

Runtime note: This quant was built and tested with ikawrakow's ik_llama.cpp fork. It has not been tested on the mainline ggml/llama.cpp. For best results and full feature support (mixed-bit quant loading), use ik_llama.cpp.

Base model: OmniMerge-v4 on Qwen3.6-27B
Size: 12.76 GiB (13.7 GB, 13,054 MiB on disk) — 4.06 bpw
VRAM: Fits 16 GB GPUs comfortably with 32k+ context
License: Apache-2.0

A custom importance-matrix-guided mixed-bit quantization of the OmniMerge-v4 merge. Achieves lower perplexity than the official uniform IQ4_XS release while being 1.3 GiB smaller — a clear win for targeted bit allocation at this size tier.


Results

Perplexity (wiki.test.raw, 580 chunks, n_ctx=512)

Model PPL Size Δ from official IQ4_XS
IQ4_XS (mixed-bit, ours) 6.864 ± 0.045 12.76 GiB −0.055
IQ4_XS (uniform, official) 6.919 ± 0.045 14.05 GiB
F16 (estimated) ~6.70 50.11 GiB
  • Both quants lose ~0.15–0.20 PPL from the F16 baseline.
  • The mixed-bit recipe recovers ~25% of that loss compared to uniform IQ4_XS, while using 9% less disk space

KLD vs Upstream IQ4_XS

Metric Value Interpretation
Mean KLD 0.0371 ± 0.0015 Low — distributions are very similar
Same top-p 92.31% ± 0.17% 92% of tokens agree on the most likely next token
Correlation 99.25% Confidence scores move almost perfectly in sync
Mean Δp −0.026% ± 0.033% No systematic bias (neither quant is consistently over/under-confident)
RMS Δp 5.23% ± 0.13% Per-token probability differences average ~5%

Percentile breakdown of disagreement (the 8% where top-p differs):

Percentile KLD threshold Meaning
50% (median) 0.013 Half of KLD mass is negligible
90% 0.063 90% of tokens are very close
95% 0.107 95% with minor divergence
99% 0.369 1% show noticeable divergence
99.9% 2.08 Rare outliers — 0.1% of tokens

The distribution is strongly right-skewed: the two quants behave nearly identically on the vast majority of tokens, with meaningful divergence only in a thin tail.

HumanEval — Corrected Score

Initial results were corrupted by a bug in lm_eval's model="gguf" type which calls .strip() on completions, removing the 4-space indentation from the first line of every Python function body. This caused nearly all samples to fail ast.parse().

After correcting the indentation (fix_completion in rescore script) and re-running code_eval:

Metric Value Notes
Preliminary corrected pass@1 90.67% (68/75) ⚠️ NOT a standard score — 75/164 problems, non-standard eval pipeline
Bugged score (reference) 2.67% (2/75) .strip() destroyed function structure
Compile OK before fix 34/75 (45.3%) Logic was often correct, syntax broken
Compile OK after fix 75/75 (100%) Indentation was the only issue

⛔ This is NOT a standard HumanEval pass@1 score.

  • Only 75 of 164 problems were scored (problem #76 hung during generation)
  • Uses create_test filter (test-generation evaluation pipeline), not standard pass@1 on code solutions
  • Not comparable to the official OmniMerge-v4 benchmark (84.76% at Q6_K, all 164 problems, with --reasoning-format deepseek --reasoning-budget 8192)
  • A full 164-problem run with standard pass@1 would give a definitive score

Position in Official Quant Ladder

Quant Size Tier
F16 (source) 50.11 GiB Lossless
IQ4_XS (uniform, official) 14.05 GiB Best uniform 4-bit
IQ4_XS (mixed-bit, ours) 12.76 GiB Beats IQ4_XS in PPL at 9% smaller
Q3_K_L (official) 13.36 GiB Standard 3-bit upper tier

My quant occupies a sweet spot: better perplexity than the larger IQ4_XS, smaller file than the smaller Q3_K_L.

Observations

  1. Mixed-bit wins at this size tier. At ~12.6 GiB, the optimizer can shift bits from low-impact FFN gates (iq3) to attention and output layers (iq5), producing measurably better quality than any uniform quantization at this or even 9% larger size.
  2. High behavioral agreement with uniform IQ4_XS. The 92% same-top-p and 0.037 KLD mean the two quants behave nearly identically per-token. The difference is a subtle quality uplift, not a different model.
  3. The HumanEval bug underscores a tooling gap. lm_eval's gguf model type strips whitespace, which is destructive for Python code. The local-completions model type doesn't have this issue. The preliminary 90.67% is indicative but not a final score.
  4. KLD here compares two quants, not quant-vs-F16. The absolute loss from the original model is unknown without loading the 50 GiB F16 reference. PPL (lower is better) is the more informative absolute metric.

Quantization Design

Why Mixed-Bit

The upstream ManniX-ITA GGUF ladder provides uniform quants — every tensor gets the same type. This quant instead uses 50 per-tensor regex rules guided by an importance matrix, allocating higher precision where it matters and compressing where it doesn't.

Key Decision: Output Weight "Safety Patch"

The GGUF-Tool-Suite optimizer's KLD-guided allocation chose iq4_k for output.weight. I overrode this to iq5_k — a deliberate +152 MB investment in the model's final linear layer, where every token's logits pass through. Under-quantizing the output head creates systematic logit errors that compound across generations. This attempted to mimic other Qwen3.6 quant attempts seen that tend to universally protect this tensor at q8_0 or higher.

Post-hoc note: Reference recipes at comparable BPW tiers suggest the optimizer would have preferred removing IQ3 from mid-layer FFN over upgrading output.weight. The override was a safety-first decision — the practical impact is small (PPL 6.86 speaks for itself), but in hindsight the optimizer's default allocation was likely just as good at this bit budget.

Mixed-Bit Strategy

Where Precision Why
Norms, biases, SSM params f32 (353 tensors) Numerical stability
SSM alpha/beta (early layers) q8_0 (58 tensors) State dynamics need fidelity
SSM alpha/beta (deep layers) iq6_k / q6_K (38 tensors) State tracking accumulates
Early attention layers (0–3) iq5_k Anchor layers, high impact
Full-attention blocks (every 4th) iq5_k for K/V/output Every 4th layer gets full attention
Bulk FFN + attention iq4_kt (257 tensors, 71%) Standard 4-bit, importance-guided
Deeper FFN gates iq3_k / iq3_kt (18 tensors) Where compression hurts least

Per-Tensor Type Distribution

QTYPE Count Use Case
f32 353 Norms, biases, SSM parameters
iq4_kt 257 Bulk FFN + attention weights
iq4_k 56 Selected FFN/attention
q8_0 58 SSM alpha/beta weights
iq5_k 38 Critical attention, output/embedding
iq4_ks 33 Selected smaller FFN/attention
iq6_k 18 Deep SSM weights
q6_K 20 Mid-layer SSM weights
iq3_k 7 Deeper FFN gates
iq3_kt 11 Embedding, early FFN

Hardware & Performance

VRAM & Context Limits

Tested on NVIDIA RTX 5070 Ti (16 GB, 16,302 MiB total) using q8_0 KV cache with flash attention.

Measured at 42k context (llama-sweep-bench, -c 42000):

Component Size
Model tensors (CUDA0) 11,746 MiB
Compute buffer (CUDA0) 505 MiB
KV cache (q8_0, 42,240 tok) 1,552 MiB (self: 1,403 MiB)
Total GPU ~13,803 MiB

KV cache per token (q8_0): 33.2 KB self size, ~36.8 KB with allocation overhead.

Practical max context: ~46k–48k — beyond this, the dynamic attention workspace (K×Q matrices, graph intermediates) competes with the remaining headroom and TG performance degrades sharply or OOM occurs. The naive linear extrapolation of KV cache alone suggests ~75k, but the real ceiling is set by the runtime compute workspace, which scales with context and is not pre-allocated.

Context Approx. total GPU VRAM Notes
8k ~12,300 MiB Comfortable, plenty of headroom
16k ~12,600 MiB
32k ~13,200 MiB
42k ~13,800 MiB Confirmed by sweep-bench
~48k ~14,300 MiB Practical max — attention workspace eats headroom
64k+ OOM / spill Dynamic workspace exceeds available VRAM
  • Text generation with sweep-bench ~40–47 t/s varies with KV cache fill level

Measured with flash attention, RTX 5070 Ti, ngl=99, n_ubatch=512. PP speed ~225 t/s is for 512-token prompt batches at the model's full precision; the earlier ~1,750 t/s figure was from PPL eval with larger batches (n_batch=512) over long sequences, measuring sustained throughput rather than step latency.

TG starts at ~47 t/s with empty KV cache and gradually declines to ~40 t/s as context fills, with occasional dips to ~36 t/s at intermediate cache sizes due to compute graph reshuffling.


Usage

llama-server

Recommended (safe, plenty of headroom):

llama-server \
  -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  -c 32768 -ngl 99 -t 12 -ub 512 \
  --jinja --reasoning-budget 1024 \
  -ctk q8_0 -ctv q8_0 -fa

For maximum context (16 GB GPU, try 60k first):

llama-server \
  -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  -c 60000 -ngl 99 -t 12 -ub 512 -amb 256 \
  --jinja --reasoning-budget 1024 \
  -ctk q8_0 -ctv q8_0 -fa

Tighter still (aggressive VRAM use):

llama-server \
  -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  -c 60000 -ngl 99 -t 12 -ub 512 -amb 256 --fit --fit-margin 256 \
  --jinja --reasoning-budget 1024 \
  -ctk q8_0 -ctv q8_0 -fa
  • -ub 512 is essential — keeps the physical batch size small, reducing dynamic attention workspace and allowing larger context before VRAM contention
  • -amb 256 caps the attention K×Q workspace at 256 MiB, preventing blowup at large context (if too low, attention computes in multiple slower passes)
  • --fit --fit-margin 128 lets the runtime use VRAM more aggressively (default safety margin is 1024 MiB); won't offload to CPU unless absolutely necessary
  • 32k context is the safe sweet spot: fits comfortably on 12–16 GB GPUs
  • 60k–64k is the practical ceiling on 16 GB; start at 60k and increase if there's headroom
  • For 12 GB GPUs, start at -c 16384 and reduce if needed

lm_eval (reproducible benchmarks)

lm_eval --model gguf \
  --model_args base_url=http://localhost:8080,model_args=-c,8192,-ngl,99,--reasoning-format,deepseek,--reasoning-budget,8192 \
  --tasks humaneval,mbpp,gpqa_diamond_generative_n_shot \
  --output_path results/

Quantization Pipeline

Toolchain: ik_llama.cpp (official fork) via Thireus custom build (build 4758, GCC 11.4.0, Linux/WSL)
Importance matrix: imatrix-Qwen3.6-27B-BF16.dat — 497 entries, 829 chunks, ubergarm calibration corpus v02
Recipe: 48 per-tensor regex rules at recipes/Qwen3.6-27B-Omnimerge-v4-12.6GiB-IQ4_XS-13GiB.recipe.txt
Pipeline: Generate recipe (quant_assign.py) → Quantize (llama-quantize) → Evaluate (PPL, KLD)

Compression

Metric Value
Source (F16) 50,111 MiB
Quant size 13,054 MiB
Compression ratio 3.93×

Recipe Evolution

  1. Optimizer-generatedoutput.weight=iq4_k, using Qwen3.6-27B KLD calibration
  2. Manual overrideoutput.weight=iq5_k (regretted "safety patch", +152 MB)

Reproducing

# Quantize
llama-quantize \
  --imatrix imatrix-Qwen3.6-27B-BF16.dat \
  --override-kv general.quantization_version=2 \
  Qwen3.6-27B-Omnimerge-v4-F16.gguf \
  Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  Q8_0 \
  local_quant/recipes/Qwen3.6-27B-Omnimerge-v4-12.6GiB-IQ4_XS-13GiB.recipe.txt

# Evaluate PPL
llama-perplexity -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  -f wiki.test.raw -ngl 99 -c 512 -b 512 -fa

# KLD vs reference
llama-perplexity -m Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.6GiB.gguf \
  -f wiki.test.raw --kl-divergence -ngl 99 -c 512

Note: Q8_0 is the fallback type; actual quantization is determined by the recipe rules.


License & Credits

License: Apache-2.0 (inherited from Qwen/Qwen3.6-27B and OmniMerge-v4)

Source model: ManniX-ITA/Qwen3.6-27B-Omnimerge-v4 — DARE-TIES merge of Qwen3.6-27B with 3 fine-tunes, MLP-passthrough surgery.

Acknowledgements:


See Also

Downloads last month
16,822
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF

Quantized
(6)
this model

Collection including k0valik/Qwen3.6-27B-Omnimerge-v4-IQ4_XS-12.76GiB-GGUF