This work extends SWAN [baa.ai, 2026] and SAT [baa.ai, 2025]. No experimental results are reported; this is an algorithmic proposal paper.
Keywords: knowledge distillation, LLM compression, weight geometry, quantization readiness, SWAN, sensitivity analysis, kurtosis regularisation, spectral conditioning
Post-training quantization (PTQ) and knowledge distillation are the two most widely deployed strategies for reducing the cost of serving large language models. Yet they are almost always applied sequentially and independently: a model is distilled into a smaller student, and then the student is quantized. This sequential pipeline treats the student’s weight geometry as an incidental output of distillation—something to be corrected post-hoc if it causes quantization problems.
The SWAN framework [1] challenges this assumption at the PTQ level, demonstrating empirically that four lightweight, data-free metrics computed directly on weight tensors are strong predictors of actual quantization error: kurtosis (ρ=0.80), output sensitivity (ρ=0.69), SVD spectral concentration (ρ=0.40), and a direct reconstruction error proxy. Critically, these metrics are non-redundant (maximum inter-metric |ρ|=0.38), meaning they capture genuinely distinct aspects of a weight tensor’s fragility under precision reduction. SWAN uses this composite to drive per-tensor bit-width allocation without any calibration data.
SAT [2] takes the next step: rather than diagnosing pathological weight geometry after training, it prevents its emergence during pre-training. SAT adds kurtosis regularisation, spectral conditioning, and targeted quantization noise injection to the pre-training loop, producing models whose internal geometry is quantization-compatible by construction.
SAKD applies both insights to the distillation setting, where they have not previously appeared. The connection is natural but non-trivial. In distillation, the student is not trained from scratch under arbitrary gradient descent; it is trained to reproduce the teacher’s representations. This raises a question that neither SWAN nor SAT addresses directly: are the teacher’s high-sensitivity layers also the most important layers for the student to match? And: does distillation training, absent geometric constraints, lead the student to develop high-kurtosis or spectrally-concentrated weight matrices?
We argue that both questions have affirmative answers. The first follows from the empirical SWAN finding that sensitivity is primarily driven by weight geometry (outlier distributions, concentrated spectra) rather than layer position alone. The second follows from the fact that a student trained by feature-matching is rewarded for reproducing teacher representations—including any outlier structure in those representations—but receives no gradient signal discouraging pathological weight geometry in its own parameters.
These two observations motivate the three SAKD mechanisms described in this paper.
SWAN computes four per-tensor sensitivity scores, each normalised to [0, 1]:
Excess kurtosis (skurt): κ = E[(w−μ)4/σ4] − 3, normalised as skurt = clip(κ/10, 0, 1). Kurtosis ρ=0.80 with actual 4-bit reconstruction error across 2,347 tensors of Qwen3.5-397B—the strongest predictor of the four.
SVD spectral concentration (ssvd): fraction of total singular-value energy in the top 10% of singular values, computed via randomised SVD (rank k=256). ρ=0.40 with reconstruction error.
Output noise amplification (samp): how much random perturbation (simulating quantization noise) is amplified through the linear transformation, estimated with random probe vectors without real activations. ρ=0.69. Log-scale normalisation required to avoid saturation on smaller models.
Reconstruction error proxy (serr): NRMSE of the weight tensor under simulated 4-bit round-to-nearest quantization with group size 128. The most direct measure; contributes a constant offset at 397B scale due to normalisation saturation (a known v2 limitation).
The composite score is a weighted sum:
with empirically optimised weights (wk=0.45, ws=0.20, wa=0.15, we=0.20 in SWAN v2). The composite achieves Pearson r=0.40 with reconstruction error, and the entire analysis completes in under 13 minutes on commodity hardware even for 400B+ parameter models.
A key structural finding from SWAN is that sensitivity patterns are architecturally consistent: attention layers consistently receive 1.6–2.5× more bits than MLP/FFN layers across dense and MoE architectures. This suggests that the sensitivity landscape of a model is determined primarily by its weight geometry rather than by any particular input distribution—supporting the data-free approach.
SAT augments the standard pre-training objective with three geometry-controlling mechanisms. The composite loss is:
Kurtosis-Driven Stability (KDS). A one-sided penalty on excess kurtosis, scaled by the layer’s SWAN sensitivity score:
The one-sided design is critical: layers with kurtosis below the target are not penalised, preserving the natural expressiveness of well-behaved weight distributions. The SWAN-weighted coefficient ensures that higher-sensitivity layers receive proportionally stronger regularisation.
Spectral Conditioning (SC). Minimises the spectral concentration ratio to maintain distributed singular-value spectra:
This implicitly maximises effective rank (the exponential of the singular-value entropy), making weight matrices more robust to the information loss inherent in quantization. It also has an independent benefit: bounding the spectral norm improves gradient stability during backpropagation.
Targeted Quantization Noise Injection (TQNI). Applies quantization noise only to layers in the top-k% of sensitivity scores (k=20 as default), concentrating the hardening effect of QAT without disrupting stable layers. The noise is calibrated to the target bit-width b:
Applied only when Sl > θnoise (the top-k% threshold). Gradients pass through via the Straight-Through Estimator (STE), identical to standard QAT practice.
Dynamic Bit-Width Allocation (DBWA). Every D=1000 steps, SWAN metrics are recomputed over a small calibration batch and per-layer training precision is reassigned: bottom quartile at 8-bit, middle half at 12-bit, top quartile at 16-bit. At the default 25/50/25 split, this reduces average training precision to 12-bit (vs 16-bit BF16), a 25% reduction in parameter memory footprint with corresponding reductions in gradient and optimiser state.
Knowledge distillation [5] trains a student to match a teacher’s output distribution. Feature-based methods additionally match intermediate representations. TinyBERT [6] distils attention matrices and hidden states layer-by-layer. Patient KD [10] selects which teacher layers to supervise, but via architectural heuristics rather than empirically measured importance. MiniLLM [4] addresses the forward KL bias by using reverse KL. DistiLLM [7] uses a skew divergence formulation.
No existing distillation method uses weight-geometry metrics to weight supervision. This is the gap SAKD addresses. The closest prior work is AWQ [8], which uses activation magnitude to identify salient weight channels—a per-parameter signal rather than the per-tensor composite that SWAN provides. The connection to SAT is more direct: SAT demonstrates that weight geometry can be controlled during training; SAKD asks whether the same control is beneficial during distillation fine-tuning.
Before describing the SAKD framework, we state the two core motivating claims explicitly. These are presented as motivated hypotheses grounded in the SWAN/SAT findings rather than experimentally verified facts in this paper.
Standard feature-matching distillation assigns equal loss weight to every teacher-student layer pair. SWAN demonstrates that teacher layers differ substantially in their weight geometry: kurtosis scores span several orders of magnitude across tensors in the same model (Fig. 4 in the SWAN paper shows this clearly for Qwen3.5-397B). Layers with high SWAN scores have weight distributions that are more fragile to perturbation—whether that perturbation is quantization rounding or a student’s imperfect reproduction.
The hypothesis is therefore: layers with high SWAN sensitivity scores are those where the student’s representational error causes the largest degradation in the teacher’s functional behaviour, because these are the layers where small input perturbations are most strongly amplified toward the output. This is grounded in the SWAN output noise amplification metric (ρ=0.69 with reconstruction error), which directly measures how a layer transforms input perturbations into output deviations.
Testable prediction: Training a student with SWAN-weighted layer losses will produce lower output KL divergence from the teacher than uniform-weighted training, at matched gradient budget, because the weighted approach concentrates gradient signal at the layers that matter most for output quality.
When a student is trained by feature-matching, its parameters are optimised to reproduce teacher representations. If those representations contain high-kurtosis structure (as SWAN documents they frequently do), the student’s weight matrices may develop correspondingly outlier-prone distributions. Even if the teacher’s geometry is clean, the unconstrained optimisation of a feature-matching loss has no mechanism to prevent kurtosis growth—a problem SAT identifies specifically in unconstrained gradient descent.
The pathological geometry of student weights matters beyond quantization: SAT’s theoretical analysis connects spectral concentration to gradient instability during training. A student with concentrated singular-value spectra will have less stable training dynamics, which may explain the commonly observed instability of aggressive feature-matching distillation at low student capacity.
Testable prediction: Students trained with kurtosis and spectral regularisation will have lower SWAN scores than students trained without these constraints, at matched perplexity—meaning they will quantize better post-distillation without additional PTQ correction.
Let T denote the teacher with weight tensors {WlT} for l=1..N, and S the student with weight tensors {WlS} for l=1..M where M≤N. Given a training set Dtrain, the SAKD pipeline proceeds in three phases.
Before any training, compute SWAN sensitivity scores over the teacher’s weight tensors. No calibration data is required. The analysis runs directly on the weight matrices:
Normalise to [0, 1] across layers. Store sensitivity map {Sl} for use throughout training. The entire computation takes <13 minutes for 400B+ parameter models.
These weights (0.45/0.20/0.15/0.20) reflect the SWAN v2 empirical optimisation over Qwen3.5-397B. We recommend revalidating these weights when the target architecture differs substantially from Qwen-family dense or MoE models, using the correlation analysis described in the SWAN paper (§4.2 therein).
When M < N (student is smaller than teacher), a subset of teacher layers must be selected for direct supervision. SAKD prioritises the most sensitive teacher layers:
This heuristic is motivated by the SWAN finding that sensitivity patterns are architecturally consistent: attention layers systematically receive 1.6–2.5× higher SWAN scores than MLP layers. A sensitivity-priority alignment therefore concentrates the student’s representational capacity on attention-heavy teacher layers, which SWAN identifies as the highest-risk layers under quantization.
Limitation: This alignment assumes that SWAN scores on teacher weight tensors predict the same ordering of layer importance from the student’s perspective. This is the central empirical hypothesis of SAKD and the primary target of the proposed experiments (§5).
The SAKD training objective is:
The feature-matching loss weights each layer pair by the teacher’s SWAN score:
where Pl is a learned linear projection mapping student hidden dimension to teacher hidden dimension, and F is a feature-matching criterion. We recommend cosine similarity loss (1 − cosine(·,·)) as it is scale-invariant and has shown robustness across distillation tasks in prior work.
The sensitivity weights use a softmax with temperature τ:
Temperature τ controls concentration. At τ=1 weights are spread across layers; at τ=0.3 the top-sensitivity quartile receives approximately 4× the weight of the median layer. We recommend a progressive annealing schedule:
This schedule starts with broad supervision (useful when the student is far from the teacher everywhere) and progressively concentrates on high-sensitivity layers (more efficient once low-sensitivity layers are well-approximated). Setting τ→∞ recovers standard uniform-weight distillation, making SAKD a strict generalisation.
The logit-level term Llogit is standard reverse-KL divergence between teacher and student output distributions, following MiniLLM [4]. Reverse KL is preferred over forward KL as it avoids the student being penalised for zero-probability teacher tokens, a known problem in LLM distillation identified by Gu et al. (2024).
The SAT-derived regularisation terms are applied to student weights during distillation training:
Prevents the student from developing outlier weight distributions in the process of matching teacher representations:
where SlS is the SWAN score of the student’s own weight tensors, recomputed at each DBWA checkpoint (every D=1000 steps). The one-sided max(0,·) design is inherited from SAT: layers with kurtosis below the target are not penalised, preserving the natural expressiveness of well-behaved distributions.
The target ceiling κtarget ∈ [1.5, 2.5] is the same hyperparameter as in SAT. This range permits moderately heavy-tailed distributions while eliminating only the extreme outliers that cause quantization damage. SAT establishes (§4.1 therein) that kurtosis regularisation does not reduce function-space expressiveness, only preferences quantization-friendly solutions within that space.
Minimises spectral concentration in student weight matrices:
σmax is approximated via power iteration (O(mn) per layer), amortised by running every k=10 steps. This regulariser simultaneously serves two purposes: (1) maintains distributed singular-value spectra for quantization robustness, and (2) bounds the spectral norm of weight updates, improving gradient stability during distillation training—a benefit particularly relevant when the student is receiving mismatched supervision from a much larger teacher.
Key claim: Student geometry regularisation produces students that are jointly optimised for (a) knowledge transfer from the teacher and (b) quantization-readiness of their own weights. Without these regularisers, standard distillation only optimises (a), and may actively harm (b) by causing the student to reproduce pathological teacher representations in its own parameters.
TDNI is the distillation adaptation of SAT’s TQNI. For student layers aligned to teacher layers with Sl > θnoise (top-20% SWAN threshold), the student weight tensor is perturbed during the forward pass:
where b is the target deployment bit-width (typically 4). Gradients pass through via STE.
The effect is to simultaneously train two objectives in a single forward pass: the distillation loss trains the student to approximate the teacher representation, while the noise injection trains the student’s parameters to be robust to the quantization noise that will be applied at deployment. This is more efficient than the standard pipeline of distillation followed by QAT fine-tuning, and it concentrates the hardening effect where SWAN indicates it is most needed.
TDNI differs from uniform QAT in the same way that TQNI differs from uniform QAT: it applies noise only to high-sensitivity layers, avoiding disruption of stable layers whose distillation training does not benefit from hardening. It differs from TQNI in that the sensitivity threshold is applied to the teacher’s SWAN scores (which layer of the teacher is being matched) rather than the student’s own scores, because the teacher’s sensitivity map is the stable, one-time-computed reference.
A question that any rigorous reviewer would raise: do the SAKD geometry regularisers interact constructively with AdamW’s second-moment normalisation? We address this explicitly.
AdamW normalises each gradient component by its running second moment, effectively adapting the learning rate per-parameter. This is a gradient-space operation. The SAKD regularisers (kurtosis, spectral) operate in weight-space: they add penalty terms to the loss function that produce additional gradient contributions. These contributions then pass through AdamW’s normalisation like any other gradient. There is no fundamental conflict.
However, a potential interaction arises: if kurtosis regularisation consistently reduces the magnitude of gradient components for outlier weights (by penalising high-|w| values), AdamW’s second-moment estimate for those components will trend downward over time, effectively increasing their adaptive learning rate. This could create a feedback loop that partially counteracts the regulariser. The SAT paper notes this interaction (§4.3 therein) and recommends scaling learning rates for lower-precision layers upward by the precision reduction ratio. In the SAKD context, we recommend monitoring kurtosis evolution during early training and scaling λκ upward if kurtosis growth is not suppressed.
Note on gradient whitening: A previous version of this framework proposed SWAN-inspired gradient whitening (WDG) as a fourth mechanism. This was removed because AdamW already performs a structural analogue of gradient whitening via second-moment normalisation. A WDG component would be meaningful only if the student were trained with a stateless optimiser such as SGD or Muon; in that case, SWAN-style per-layer gradient normalisation (as described in the SAT framework) could be directly applied to the distillation loss gradient.
Transparency note: No experimental results are reported in this paper. The following describes the experimental protocol we propose to validate SAKD. This framing mirrors the approach taken in the SAT paper, which similarly presents a theoretical framework and defers empirical validation to future work.
Three questions should drive the experimental design:
Q1: Does SWAN-weighted supervision outperform uniform-weight supervision? This tests the central hypothesis of §3.1. Evaluation: WikiText-103 perplexity and downstream benchmark scores (HellaSwag, MMLU, ARC-Challenge, GSM8K) for SAKD vs PKD-uniform at matched training budget, identical architecture and teacher.
Q2: Does SGR produce students with lower SWAN scores? This tests the central hypothesis of §3.2. Evaluation: SWAN composite scores of the trained student, and post-hoc PTQ quality (perplexity after uniform 4-bit RTN quantization) with vs without SGR regularisation. A student with lower SWAN scores should quantize better without additional PTQ correction.
Q3: Does TDNI improve post-distillation quantization quality without degrading distillation quality? Evaluation: distilled student perplexity with TDNI vs without, then post-distillation 4-bit PTQ perplexity. TDNI should improve the latter with minimal impact on the former.
| Teacher | Student | Compression Ratio | Family | Primary Purpose |
|---|---|---|---|---|
| Qwen3-8B | Qwen3-1.7B* | ~5× | Qwen | Test Qs 1 & 2 |
| Qwen3-8B | Qwen3-0.6B* | ~13× | Qwen | Stress-test (aggressive) |
| Llama-3.1-8B | Llama-3.2-1B | ~8× | Llama | Cross-architecture |
| Qwen3.5-397B (4-bit) | Qwen3-8B* | ~50× | MoE→dense | Test Q3 (TDNI) |
Table 1: Proposed model pairs for experimental validation. No results are reported; these are proposed experiments. * Student architecture uses standard Qwen/Llama architecture without SWAN-guided capacity allocation.
Baselines: (1) Standard output KD only. (2) Patient KD with uniform layer weights. (3) MiniLLM (reverse-KL, no feature matching). (4) SAKD without SGR (to isolate geometry regularisation contribution). (5) SAKD without TDNI.
The novel metric in SAKD evaluation is the SWAN audit: after training, run SWAN’s full pipeline (all four metrics) on the trained student. This produces a per-tensor sensitivity profile. The primary metric is the distribution of composite SWAN scores across student tensors. Secondary metric: perplexity after applying uniform 4-bit RTN quantization to the trained student (without any further PTQ calibration). This directly tests whether the student’s learned geometry is quantization-ready.
This metric does not exist in prior distillation benchmarks. Its introduction is itself a contribution: it reframes distillation evaluation as encompassing not only task performance but also deployment geometry.
We provide informal but rigorous theoretical motivation for the SAKD components. We deliberately avoid presenting results as formal theorems, given that the underlying assumptions (local linearity, bounded curvature, etc.) require empirical validation before they support theorem-level claims.
Consider a teacher layer l with weight W and output h = Wx. The output noise amplification metric samp measures the expected ratio ||W·ε|| / ||ε|| for random probe vectors ε. In expectation over random ε with unit variance, this is the expected operator norm of W, which is bounded above and below by the largest and average singular values of W respectively. High samp therefore indicates that W amplifies perturbations strongly—i.e., that the layer has large Jacobian norm with respect to its inputs.
A layer with large Jacobian norm ∂h/∂x propagates representational errors from the student strongly forward through the network. If the student makes an error εl = hlS − hlT at layer l, the expected contribution to the final output divergence is proportional to the product of Jacobian norms of all subsequent layers. High-sensitivity layers (those near the end of the network or with intrinsically large Jacobian norms) therefore contribute disproportionately to output-level teacher-student divergence—motivating higher distillation loss weights.
The kurtosis metric skurt is a weight-distribution statistic with no direct Jacobian interpretation. Its SWAN correlation (ρ=0.80) reflects the empirical observation that layers with outlier weight distributions tend to have high effective condition numbers, which in turn amplifies the Frobenius-norm impact of quantization noise on outputs. This is consistent with the Hessian-based sensitivity analyses in GPTQ [3] and HAWQ [11], which find that layers with poor weight conditioning require more bits.
SAT establishes (§4.1) that kurtosis regularisation does not reduce model expressiveness because it targets the tails of weight distributions (the tiny fraction of weights responsible for outlier kurtosis) rather than the bulk distribution. The same argument applies in the distillation setting: the student is not prevented from learning to reproduce teacher representations; it is only prevented from doing so by developing weight matrices with extreme outliers.
There is one distillation-specific consideration: if the teacher’s hidden representations themselves have high kurtosis (as is common in transformer models trained without geometry control), feature-matching loss will push the student’s activations to match those high-kurtosis outputs. Kurtosis regularisation on student weights does not prevent this activation-level matching—it only constrains the weight distribution that produces those activations. A student can produce high-kurtosis activations from well-distributed weights if the input distribution is itself high-kurtosis, though the gradient dynamics become more stable.
SWAN and SAT together establish a position: weight geometry is not an incidental property of trained models but a controllable quantity with measurable consequences for deployment. SAKD extends this position to the distillation setting, arguing that a distilled student should be optimised simultaneously for task-performance alignment and geometric deployment-readiness.
This is a shift from the standard framing of distillation as purely a knowledge-transfer problem. In the standard framing, the student succeeds if it approximates the teacher’s task performance; geometric properties are a post-hoc concern. SAKD argues that geometric properties can be optimised during distillation at negligible cost (the SAT regularisers add <1% overhead relative to a forward pass), and that doing so can eliminate the need for a separate PTQ step—or at minimum, produce a student that responds better to PTQ when it is applied.
The SWAN paper identifies several open questions that SAKD directly addresses:
“Combining SWAN’s sensitivity map with calibration-based fine-tuning for a ‘coarse-to-fine’ quantization pipeline”: SAKD is a form of coarse-to-fine pipeline, but with distillation as the refinement stage rather than PTQ fine-tuning. The student’s reduced parameter count makes calibration-based methods more tractable after SAKD training.
“Reconstruction error proxy saturation at the 397B scale”: SAKD’s use of the composite score for loss weighting rather than bit-width allocation means that normalisation saturation is less damaging. A saturated serr contributes a constant term to all layer weights equally, reverting toward the other three metrics rather than causing incorrect bit assignments.
“Adaptive reconstruction error normalisation across model scales”: When applying SAKD across different teacher sizes, per-architecture normalisation of the reconstruction error proxy should be applied, using the same adaptive scaling the SWAN paper recommends for v3 development.
We enumerate the primary limitations of the SAKD framework as presented:
Unvalidated core hypotheses. The two central claims of §3 are not experimentally verified in this paper. The framework’s value depends on these holding empirically. They are falsifiable and we have described experiments to test them.
SWAN score transferability assumption. We assume that teacher SWAN scores accurately proxy which teacher layers are most important for a student to match. This assumes that sensitivity to quantization noise and sensitivity to student representational error are correlated—a reasonable but unverified assumption. If teacher layers are sensitive to quantization specifically due to outlier activations (rather than weight geometry), the weight-based SWAN metrics may not translate.
Hyperparameter coupling. SAT’s hyperparameters (κtarget, λκ, λσ, θnoise, D) were developed for pre-training from scratch. Their optimal values in the distillation fine-tuning setting—where the student starts from a pretrained checkpoint and is trained for far fewer steps—may differ substantially. Systematic hyperparameter exploration is needed.
Metric saturation on smaller models. SWAN’s output noise amplification saturates at 1.0 for all tensors in the 8B-scale model (v1 metrics). This means the composite score on smaller models is effectively three-metric rather than four-metric. The log-scale normalisation in v2 partially addresses this; we recommend applying v2 metrics and validating correlation statistics on any new architecture before relying on the composite for SAKD weighting.
No multi-teacher extension. SAKD is designed for single-teacher distillation. Extension to multi-teacher settings (where different teachers are used for different layer groups) would require a reconciliation of SWAN profiles across teacher architectures.
We have presented SAKD, a framework that applies the SWAN/SAT weight-geometry philosophy to the knowledge distillation setting. The framework makes three contributions: sensitivity-weighted feature-matching loss derived from data-free SWAN profiling of the teacher; student geometry regularisation using SAT-derived kurtosis and spectral conditioning; and Targeted Distillation Noise Injection, which co-trains knowledge transfer and quantization hardening in a single pass.
SAKD is grounded in real empirical findings: the SWAN validation of its four metrics against actual quantization error (up to ρ=0.80 for kurtosis, ρ=0.69 for output sensitivity), and SAT’s demonstration that weight geometry can be regulated during training without sacrificing expressiveness. The framework proposes two testable hypotheses—that SWAN-weighted supervision outperforms uniform supervision, and that SGR produces more quantization-ready students—with concrete evaluation protocols described.
The honest assessment is that SAKD makes a coherent and motivated proposal. Whether it works empirically is an open question that requires implementation. The framework’s value lies in treating distillation and deployment-readiness as jointly optimisable, rather than sequential concerns—an architectural choice that, if validated, would simplify the LLM compression pipeline while improving the quality of its outputs.
[1] baa.ai (2026). “SWAN: Data-Free Mixed-Precision Quantization for Large Language Models via Multi-Metric Sensitivity Analysis.” baa.ai Research Publication.
[2] baa.ai (2025). “Sensitivity-Aware Training (SAT): Using Statistical Weight Geometry to Guide LLM Training Dynamics.” baa.ai Research Publication, CC BY-NC-ND 4.0.
[3] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” In ICLR, 2023.
[4] Y. Gu, L. Dong, F. Wei, and M. Huang. “MiniLLM: Knowledge Distillation of Large Language Models.” In ICLR, 2024.
[5] G. Hinton, O. Vinyals, and J. Dean. “Distilling the Knowledge in a Neural Network.” NIPS Deep Learning Workshop, 2015.
[6] X. Jiao et al. “TinyBERT: Distilling BERT for Natural Language Understanding.” EMNLP Findings, 2020.
[7] J. Ko et al. “DistiLLM: Towards Streamlined Distillation for Large Language Models.” In ICML, 2024.
[8] J. Lin et al. “AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration.” In MLSys, 2024.
[9] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. “Spectral Normalization for Generative Adversarial Networks.” In ICLR, 2018.
[10] S. Sun et al. “Patient Knowledge Distillation for BERT Model Compression.” In EMNLP, 2019.
[11] Z. Dong et al. “HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision.” In ICCV, 2019.
[12] M. S. Akhondzadeh et al. “KurTail: Kurtosis-based LLM Quantization.” Findings of EMNLP, 2025.
[13] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.” NeurIPS, 35, 30318–30332, 2022.
The following table reproduces the SWAN metric correlation statistics from the published SWAN paper [1], Table 2, computed on Qwen3.5-397B (2,347 weight tensors). These are the empirical values that ground SAKD’s use of SWAN scores as distillation weights.
| Metric | Spearman ρ | p-value | Pearson r | p-value | SAKD Weight |
|---|---|---|---|---|---|
| Excess kurtosis | 0.796 | <0.001 | 0.347 | <0.001 | 0.45 |
| Output noise amplification | 0.694 | <0.001 | 0.316 | <0.001 | 0.15 |
| SVD spectral concentration | 0.399 | <0.001 | 0.298 | <0.001 | 0.20 |
| Reconstruction error proxy | —* | —* | —* | —* | 0.20 |
Table 2: SWAN metric correlation with 4-bit reconstruction error (Qwen3.5-397B, 2,347 tensors). * Reconstruction error proxy saturates at 1.0 for most tensors at 397B scale, preventing reliable Spearman correlation measurement. SAKD inherits the v2 weight of 0.20 but recommends adaptive normalisation to prevent saturation (see §7.3).
| Parameter | Recommended | Description | Source |
|---|---|---|---|
| κtarget | 1.5–2.5 | Kurtosis ceiling; above this, the regulariser activates | SAT §6.2 |
| λκ | 1e-4 – 1e-3 | Global kurtosis regularisation coefficient | SAT §6.2 |
| λσ | 1e-4 | Global spectral conditioning coefficient | SAT §6.2 |
| τmax, τmin | 1.0, 0.3 | Temperature annealing range for loss weighting | SAKD §4.3 |
| θnoise | Top 20% | TDNI activation threshold (SWAN score percentile) | SAT §3.3.2 (TQNI) |
| b (TDNI) | 4 | Target bit-width for noise calibration | SAKD §4.5 |
| D (DBWA interval) | 1000 steps | Steps between SWAN re-diagnostic checkpoints | SAT §3.4.2 |
| β, γ | 0.5, 0.3 | Logit-KD and feature-matching loss coefficients | SAKD §4.3 |
Table 3: SAKD hyperparameter reference. All SAT-inherited parameters retain their recommended ranges from the SAT paper. SAKD-specific parameters (τ, β, γ) are initial recommendations pending empirical tuning.
© 2026 baa.ai. All rights reserved. Licensed under CC BY-NC-ND 4.0.
Generated from SAKD research data. Last updated: February 2026.