SAKD: SWAN-Guided Knowledge Distillation — Applying Weight-Geometry Sensitivity Analysis to Teacher-Student Transfer

This work extends SWAN [baa.ai, 2026] and SAT [baa.ai, 2025]. No experimental results are reported; this is an algorithmic proposal paper.

Abstract. The SWAN framework demonstrates that four data-free weight-geometry metrics—excess kurtosis, SVD spectral concentration, output noise amplification, and reconstruction error proxy—are empirically predictive of quantization error, with kurtosis alone achieving Spearman ρ = 0.80 across 2,347 tensors in a 397B-parameter model. SAT extends this diagnostic into the training loop, embedding kurtosis regularisation, spectral conditioning, and targeted noise injection directly into pre-training to produce models whose weight geometry is quantization-ready by construction. We propose SAKD (SWAN-guided Knowledge Distillation), a framework that transfers these two ideas into the teacher-student distillation setting. SAKD makes three contributions. First, it uses SWAN’s multi-metric composite score, computed over teacher weights without any calibration data, as a principled weighting function for intermediate distillation losses—so that the student receives proportionally stronger supervision from teacher layers whose weight geometry indicates high quantization sensitivity. Second, it applies SAT-derived kurtosis and spectral regularisation to the student during distillation training, preventing the student from developing the same pathological weight distributions that unconstrained distillation tends to produce or amplify. Third, it introduces Targeted Distillation Noise Injection (TDNI), which co-trains distillation fidelity and quantization hardening in a single forward pass by injecting calibrated quantization noise into student layers aligned to high-SWAN-score teacher layers. Together these mechanisms produce students that are simultaneously better knowledge-transfer targets and better quantization-ready models—without additional training runs. This paper presents the theoretical framework, detailed algorithm, and a proposed experimental protocol. No empirical results are reported; this is an algorithmic proposal intended to motivate implementation.

Keywords: knowledge distillation, LLM compression, weight geometry, quantization readiness, SWAN, sensitivity analysis, kurtosis regularisation, spectral conditioning

1. Introduction

Post-training quantization (PTQ) and knowledge distillation are the two most widely deployed strategies for reducing the cost of serving large language models. Yet they are almost always applied sequentially and independently: a model is distilled into a smaller student, and then the student is quantized. This sequential pipeline treats the student’s weight geometry as an incidental output of distillation—something to be corrected post-hoc if it causes quantization problems.

The SWAN framework [1] challenges this assumption at the PTQ level, demonstrating empirically that four lightweight, data-free metrics computed directly on weight tensors are strong predictors of actual quantization error: kurtosis (ρ=0.80), output sensitivity (ρ=0.69), SVD spectral concentration (ρ=0.40), and a direct reconstruction error proxy. Critically, these metrics are non-redundant (maximum inter-metric |ρ|=0.38), meaning they capture genuinely distinct aspects of a weight tensor’s fragility under precision reduction. SWAN uses this composite to drive per-tensor bit-width allocation without any calibration data.

SAT [2] takes the next step: rather than diagnosing pathological weight geometry after training, it prevents its emergence during pre-training. SAT adds kurtosis regularisation, spectral conditioning, and targeted quantization noise injection to the pre-training loop, producing models whose internal geometry is quantization-compatible by construction.

SAKD applies both insights to the distillation setting, where they have not previously appeared. The connection is natural but non-trivial. In distillation, the student is not trained from scratch under arbitrary gradient descent; it is trained to reproduce the teacher’s representations. This raises a question that neither SWAN nor SAT addresses directly: are the teacher’s high-sensitivity layers also the most important layers for the student to match? And: does distillation training, absent geometric constraints, lead the student to develop high-kurtosis or spectrally-concentrated weight matrices?

We argue that both questions have affirmative answers. The first follows from the empirical SWAN finding that sensitivity is primarily driven by weight geometry (outlier distributions, concentrated spectra) rather than layer position alone. The second follows from the fact that a student trained by feature-matching is rewarded for reproducing teacher representations—including any outlier structure in those representations—but receives no gradient signal discouraging pathological weight geometry in its own parameters.

These two observations motivate the three SAKD mechanisms described in this paper.

2. Background

2.1 SWAN: Data-Free Sensitivity Metrics

Excess kurtosis (s_kurt): κ = E[(w−μ)⁴/σ⁴] − 3, normalised as s_kurt = clip(κ/10, 0, 1). Kurtosis ρ=0.80 with actual 4-bit reconstruction error across 2,347 tensors of Qwen3.5-397B—the strongest predictor of the four.

SVD spectral concentration (s_svd): fraction of total singular-value energy in the top 10% of singular values, computed via randomised SVD (rank k=256). ρ=0.40 with reconstruction error.

Output noise amplification (s_amp): how much random perturbation (simulating quantization noise) is amplified through the linear transformation, estimated with random probe vectors without real activations. ρ=0.69. Log-scale normalisation required to avoid saturation on smaller models.

Reconstruction error proxy (s_err): NRMSE of the weight tensor under simulated 4-bit round-to-nearest quantization with group size 128. The most direct measure; contributes a constant offset at 397B scale due to normalisation saturation (a known v2 limitation).

with empirically optimised weights (w_k=0.45, w_s=0.20, w_a=0.15, w_e=0.20 in SWAN v2). The composite achieves Pearson r=0.40 with reconstruction error, and the entire analysis completes in under 13 minutes on commodity hardware even for 400B+ parameter models.

A key structural finding from SWAN is that sensitivity patterns are architecturally consistent: attention layers consistently receive 1.6–2.5× more bits than MLP/FFN layers across dense and MoE architectures. This suggests that the sensitivity landscape of a model is determined primarily by its weight geometry rather than by any particular input distribution—supporting the data-free approach.

2.2 SAT: Embedding Sensitivity Control Into Training

SAT augments the standard pre-training objective with three geometry-controlling mechanisms. The composite loss is:

Kurtosis-Driven Stability (KDS). A one-sided penalty on excess kurtosis, scaled by the layer’s SWAN sensitivity score:

The one-sided design is critical: layers with kurtosis below the target are not penalised, preserving the natural expressiveness of well-behaved weight distributions. The SWAN-weighted coefficient ensures that higher-sensitivity layers receive proportionally stronger regularisation.

Spectral Conditioning (SC). Minimises the spectral concentration ratio to maintain distributed singular-value spectra:

This implicitly maximises effective rank (the exponential of the singular-value entropy), making weight matrices more robust to the information loss inherent in quantization. It also has an independent benefit: bounding the spectral norm improves gradient stability during backpropagation.

Targeted Quantization Noise Injection (TQNI). Applies quantization noise only to layers in the top-k% of sensitivity scores (k=20 as default), concentrating the hardening effect of QAT without disrupting stable layers. The noise is calibrated to the target bit-width b:

Applied only when S_l > θ_noise (the top-k% threshold). Gradients pass through via the Straight-Through Estimator (STE), identical to standard QAT practice.

Dynamic Bit-Width Allocation (DBWA). Every D=1000 steps, SWAN metrics are recomputed over a small calibration batch and per-layer training precision is reassigned: bottom quartile at 8-bit, middle half at 12-bit, top quartile at 16-bit. At the default 25/50/25 split, this reduces average training precision to 12-bit (vs 16-bit BF16), a 25% reduction in parameter memory footprint with corresponding reductions in gradient and optimiser state.

2.3 Knowledge Distillation: Relevant Prior Work

Knowledge distillation [5] trains a student to match a teacher’s output distribution. Feature-based methods additionally match intermediate representations. TinyBERT [6] distils attention matrices and hidden states layer-by-layer. Patient KD [10] selects which teacher layers to supervise, but via architectural heuristics rather than empirically measured importance. MiniLLM [4] addresses the forward KL bias by using reverse KL. DistiLLM [7] uses a skew divergence formulation.

No existing distillation method uses weight-geometry metrics to weight supervision. This is the gap SAKD addresses. The closest prior work is AWQ [8], which uses activation magnitude to identify salient weight channels—a per-parameter signal rather than the per-tensor composite that SWAN provides. The connection to SAT is more direct: SAT demonstrates that weight geometry can be controlled during training; SAKD asks whether the same control is beneficial during distillation fine-tuning.

3. Motivating Observations

Before describing the SAKD framework, we state the two core motivating claims explicitly. These are presented as motivated hypotheses grounded in the SWAN/SAT findings rather than experimentally verified facts in this paper.

3.1 Teacher SWAN Scores Predict Distillation Layer Importance

Standard feature-matching distillation assigns equal loss weight to every teacher-student layer pair. SWAN demonstrates that teacher layers differ substantially in their weight geometry: kurtosis scores span several orders of magnitude across tensors in the same model (Fig. 4 in the SWAN paper shows this clearly for Qwen3.5-397B). Layers with high SWAN scores have weight distributions that are more fragile to perturbation—whether that perturbation is quantization rounding or a student’s imperfect reproduction.

The hypothesis is therefore: layers with high SWAN sensitivity scores are those where the student’s representational error causes the largest degradation in the teacher’s functional behaviour, because these are the layers where small input perturbations are most strongly amplified toward the output. This is grounded in the SWAN output noise amplification metric (ρ=0.69 with reconstruction error), which directly measures how a layer transforms input perturbations into output deviations.

Testable prediction: Training a student with SWAN-weighted layer losses will produce lower output KL divergence from the teacher than uniform-weighted training, at matched gradient budget, because the weighted approach concentrates gradient signal at the layers that matter most for output quality.

3.2 Distillation Training Induces Pathological Student Weight Geometry

When a student is trained by feature-matching, its parameters are optimised to reproduce teacher representations. If those representations contain high-kurtosis structure (as SWAN documents they frequently do), the student’s weight matrices may develop correspondingly outlier-prone distributions. Even if the teacher’s geometry is clean, the unconstrained optimisation of a feature-matching loss has no mechanism to prevent kurtosis growth—a problem SAT identifies specifically in unconstrained gradient descent.

The pathological geometry of student weights matters beyond quantization: SAT’s theoretical analysis connects spectral concentration to gradient instability during training. A student with concentrated singular-value spectra will have less stable training dynamics, which may explain the commonly observed instability of aggressive feature-matching distillation at low student capacity.

Testable prediction: Students trained with kurtosis and spectral regularisation will have lower SWAN scores than students trained without these constraints, at matched perplexity—meaning they will quantize better post-distillation without additional PTQ correction.

4. The SAKD Framework

Let T denote the teacher with weight tensors {W_l^T} for l=1..N, and S the student with weight tensors {W_l^S} for l=1..M where M≤N. Given a training set D_train, the SAKD pipeline proceeds in three phases.

4.1 Phase 0: Teacher SWAN Profiling (Data-Free, One-Time)

Before any training, compute SWAN sensitivity scores over the teacher’s weight tensors. No calibration data is required. The analysis runs directly on the weight matrices:

Normalise to [0, 1] across layers. Store sensitivity map {S_l} for use throughout training. The entire computation takes <13 minutes for 400B+ parameter models.

These weights (0.45/0.20/0.15/0.20) reflect the SWAN v2 empirical optimisation over Qwen3.5-397B. We recommend revalidating these weights when the target architecture differs substantially from Qwen-family dense or MoE models, using the correlation analysis described in the SWAN paper (§4.2 therein).

4.2 Phase 1: Layer Alignment via Sensitivity Priority

When M < N (student is smaller than teacher), a subset of teacher layers must be selected for direct supervision. SAKD prioritises the most sensitive teacher layers:

This heuristic is motivated by the SWAN finding that sensitivity patterns are architecturally consistent: attention layers systematically receive 1.6–2.5× higher SWAN scores than MLP layers. A sensitivity-priority alignment therefore concentrates the student’s representational capacity on attention-heavy teacher layers, which SWAN identifies as the highest-risk layers under quantization.

Limitation: This alignment assumes that SWAN scores on teacher weight tensors predict the same ordering of layer importance from the student’s perspective. This is the central empirical hypothesis of SAKD and the primary target of the proposed experiments (§5).

4.3 Phase 2: Sensitivity-Weighted Distillation Loss (SWDL)

The feature-matching loss weights each layer pair by the teacher’s SWAN score:

where P_l is a learned linear projection mapping student hidden dimension to teacher hidden dimension, and F is a feature-matching criterion. We recommend cosine similarity loss (1 − cosine(·,·)) as it is scale-invariant and has shown robustness across distillation tasks in prior work.

Temperature τ controls concentration. At τ=1 weights are spread across layers; at τ=0.3 the top-sensitivity quartile receives approximately 4× the weight of the median layer. We recommend a progressive annealing schedule:

This schedule starts with broad supervision (useful when the student is far from the teacher everywhere) and progressively concentrates on high-sensitivity layers (more efficient once low-sensitivity layers are well-approximated). Setting τ→∞ recovers standard uniform-weight distillation, making SAKD a strict generalisation.

The logit-level term L_logit is standard reverse-KL divergence between teacher and student output distributions, following MiniLLM [4]. Reverse KL is preferred over forward KL as it avoids the student being penalised for zero-probability teacher tokens, a known problem in LLM distillation identified by Gu et al. (2024).

4.4 Phase 2: Student Geometry Regularisation (SGR)

The SAT-derived regularisation terms are applied to student weights during distillation training:

Student kurtosis regularisation

Prevents the student from developing outlier weight distributions in the process of matching teacher representations:

where S_l^S is the SWAN score of the student’s own weight tensors, recomputed at each DBWA checkpoint (every D=1000 steps). The one-sided max(0,·) design is inherited from SAT: layers with kurtosis below the target are not penalised, preserving the natural expressiveness of well-behaved distributions.

The target ceiling κ_target ∈ [1.5, 2.5] is the same hyperparameter as in SAT. This range permits moderately heavy-tailed distributions while eliminating only the extreme outliers that cause quantization damage. SAT establishes (§4.1 therein) that kurtosis regularisation does not reduce function-space expressiveness, only preferences quantization-friendly solutions within that space.

Student spectral conditioning

σ_max is approximated via power iteration (O(mn) per layer), amortised by running every k=10 steps. This regulariser simultaneously serves two purposes: (1) maintains distributed singular-value spectra for quantization robustness, and (2) bounds the spectral norm of weight updates, improving gradient stability during distillation training—a benefit particularly relevant when the student is receiving mismatched supervision from a much larger teacher.

Key claim: Student geometry regularisation produces students that are jointly optimised for (a) knowledge transfer from the teacher and (b) quantization-readiness of their own weights. Without these regularisers, standard distillation only optimises (a), and may actively harm (b) by causing the student to reproduce pathological teacher representations in its own parameters.

4.5 Targeted Distillation Noise Injection (TDNI)

TDNI is the distillation adaptation of SAT’s TQNI. For student layers aligned to teacher layers with S_l > θ_noise (top-20% SWAN threshold), the student weight tensor is perturbed during the forward pass:

where b is the target deployment bit-width (typically 4). Gradients pass through via STE.

The effect is to simultaneously train two objectives in a single forward pass: the distillation loss trains the student to approximate the teacher representation, while the noise injection trains the student’s parameters to be robust to the quantization noise that will be applied at deployment. This is more efficient than the standard pipeline of distillation followed by QAT fine-tuning, and it concentrates the hardening effect where SWAN indicates it is most needed.

TDNI differs from uniform QAT in the same way that TQNI differs from uniform QAT: it applies noise only to high-sensitivity layers, avoiding disruption of stable layers whose distillation training does not benefit from hardening. It differs from TQNI in that the sensitivity threshold is applied to the teacher’s SWAN scores (which layer of the teacher is being matched) rather than the student’s own scores, because the teacher’s sensitivity map is the stable, one-time-computed reference.

4.6 Interaction with the AdamW Optimiser

A question that any rigorous reviewer would raise: do the SAKD geometry regularisers interact constructively with AdamW’s second-moment normalisation? We address this explicitly.

AdamW normalises each gradient component by its running second moment, effectively adapting the learning rate per-parameter. This is a gradient-space operation. The SAKD regularisers (kurtosis, spectral) operate in weight-space: they add penalty terms to the loss function that produce additional gradient contributions. These contributions then pass through AdamW’s normalisation like any other gradient. There is no fundamental conflict.

However, a potential interaction arises: if kurtosis regularisation consistently reduces the magnitude of gradient components for outlier weights (by penalising high-|w| values), AdamW’s second-moment estimate for those components will trend downward over time, effectively increasing their adaptive learning rate. This could create a feedback loop that partially counteracts the regulariser. The SAT paper notes this interaction (§4.3 therein) and recommends scaling learning rates for lower-precision layers upward by the precision reduction ratio. In the SAKD context, we recommend monitoring kurtosis evolution during early training and scaling λ_κ upward if kurtosis growth is not suppressed.

Note on gradient whitening: A previous version of this framework proposed SWAN-inspired gradient whitening (WDG) as a fourth mechanism. This was removed because AdamW already performs a structural analogue of gradient whitening via second-moment normalisation. A WDG component would be meaningful only if the student were trained with a stateless optimiser such as SGD or Muon; in that case, SWAN-style per-layer gradient normalisation (as described in the SAT framework) could be directly applied to the distillation loss gradient.

4.7 Algorithm Summary

SAKD Training Pipeline

Phase 0 (~13 min, one-time):
Compute SWAN sensitivity map {S_l} over teacher weight tensors.
No calibration data required. Store composite scores and layer ranking.

Phase 1 (minutes):
Determine teacher-student layer alignment by sensitivity priority.
Initialise learned projection matrices P_l.
Optionally initialise student weights by copying teacher weights from
lowest-sensitivity layers, rescaling to match target dimension.

Phase 2 (training):
Train student on D_train with L_SAKD objective.
Anneal τ from 1.0 to 0.3.
Apply kurtosis + spectral regularisation to student weights.
Apply TDNI to layers with teacher SWAN score S_l > θ_noise.
Every D=1000 steps:
Recompute student SWAN scores.
Update per-layer regularisation coefficients.
Optional: reduce learning rate for stable (low-S_l) layers.

Post-training:
Student SWAN scores serve directly as input to standard SWAN PTQ
if further quantization is desired — eliminating one full PTQ
analysis step from the deployment pipeline.

5. Proposed Experimental Protocol

Transparency note: No experimental results are reported in this paper. The following describes the experimental protocol we propose to validate SAKD. This framing mirrors the approach taken in the SAT paper, which similarly presents a theoretical framework and defers empirical validation to future work.

5.1 Core Evaluation Questions

Q1: Does SWAN-weighted supervision outperform uniform-weight supervision? This tests the central hypothesis of §3.1. Evaluation: WikiText-103 perplexity and downstream benchmark scores (HellaSwag, MMLU, ARC-Challenge, GSM8K) for SAKD vs PKD-uniform at matched training budget, identical architecture and teacher.

Q2: Does SGR produce students with lower SWAN scores? This tests the central hypothesis of §3.2. Evaluation: SWAN composite scores of the trained student, and post-hoc PTQ quality (perplexity after uniform 4-bit RTN quantization) with vs without SGR regularisation. A student with lower SWAN scores should quantize better without additional PTQ correction.

Q3: Does TDNI improve post-distillation quantization quality without degrading distillation quality? Evaluation: distilled student perplexity with TDNI vs without, then post-distillation 4-bit PTQ perplexity. TDNI should improve the latter with minimal impact on the former.

5.2 Model Pairs and Baselines

Table 1: Proposed model pairs for experimental validation. No results are reported; these are proposed experiments. * Student architecture uses standard Qwen/Llama architecture without SWAN-guided capacity allocation.

Teacher	Student	Compression Ratio	Family	Primary Purpose
Qwen3-8B	Qwen3-1.7B*	~5×	Qwen	Test Qs 1 & 2
Qwen3-8B	Qwen3-0.6B*	~13×	Qwen	Stress-test (aggressive)
Llama-3.1-8B	Llama-3.2-1B	~8×	Llama	Cross-architecture
Qwen3.5-397B (4-bit)	Qwen3-8B*	~50×	MoE→dense	Test Q3 (TDNI)

Baselines: (1) Standard output KD only. (2) Patient KD with uniform layer weights. (3) MiniLLM (reverse-KL, no feature matching). (4) SAKD without SGR (to isolate geometry regularisation contribution). (5) SAKD without TDNI.

5.3 Metric for Hypothesis 2: SWAN Post-Distillation Audit

The novel metric in SAKD evaluation is the SWAN audit: after training, run SWAN’s full pipeline (all four metrics) on the trained student. This produces a per-tensor sensitivity profile. The primary metric is the distribution of composite SWAN scores across student tensors. Secondary metric: perplexity after applying uniform 4-bit RTN quantization to the trained student (without any further PTQ calibration). This directly tests whether the student’s learned geometry is quantization-ready.

This metric does not exist in prior distillation benchmarks. Its introduction is itself a contribution: it reframes distillation evaluation as encompassing not only task performance but also deployment geometry.

6. Theoretical Motivation

We provide informal but rigorous theoretical motivation for the SAKD components. We deliberately avoid presenting results as formal theorems, given that the underlying assumptions (local linearity, bounded curvature, etc.) require empirical validation before they support theorem-level claims.

6.1 Why SWAN Scores Approximate Layer-Wise Jacobian Norms

Consider a teacher layer l with weight W and output h = Wx. The output noise amplification metric s_amp measures the expected ratio ||W·ε|| / ||ε|| for random probe vectors ε. In expectation over random ε with unit variance, this is the expected operator norm of W, which is bounded above and below by the largest and average singular values of W respectively. High s_amp therefore indicates that W amplifies perturbations strongly—i.e., that the layer has large Jacobian norm with respect to its inputs.

A layer with large Jacobian norm ∂h/∂x propagates representational errors from the student strongly forward through the network. If the student makes an error ε_l = h_l^S − h_l^T at layer l, the expected contribution to the final output divergence is proportional to the product of Jacobian norms of all subsequent layers. High-sensitivity layers (those near the end of the network or with intrinsically large Jacobian norms) therefore contribute disproportionately to output-level teacher-student divergence—motivating higher distillation loss weights.

The kurtosis metric s_kurt is a weight-distribution statistic with no direct Jacobian interpretation. Its SWAN correlation (ρ=0.80) reflects the empirical observation that layers with outlier weight distributions tend to have high effective condition numbers, which in turn amplifies the Frobenius-norm impact of quantization noise on outputs. This is consistent with the Hessian-based sensitivity analyses in GPTQ [3] and HAWQ [11], which find that layers with poor weight conditioning require more bits.

6.2 Why Kurtosis Regularisation Does Not Harm Distillation

SAT establishes (§4.1) that kurtosis regularisation does not reduce model expressiveness because it targets the tails of weight distributions (the tiny fraction of weights responsible for outlier kurtosis) rather than the bulk distribution. The same argument applies in the distillation setting: the student is not prevented from learning to reproduce teacher representations; it is only prevented from doing so by developing weight matrices with extreme outliers.

There is one distillation-specific consideration: if the teacher’s hidden representations themselves have high kurtosis (as is common in transformer models trained without geometry control), feature-matching loss will push the student’s activations to match those high-kurtosis outputs. Kurtosis regularisation on student weights does not prevent this activation-level matching—it only constrains the weight distribution that produces those activations. A student can produce high-kurtosis activations from well-distributed weights if the input distribution is itself high-kurtosis, though the gradient dynamics become more stable.

7. Discussion

7.1 The Broader Argument: Geometry as a First-Class Training Objective

SWAN and SAT together establish a position: weight geometry is not an incidental property of trained models but a controllable quantity with measurable consequences for deployment. SAKD extends this position to the distillation setting, arguing that a distilled student should be optimised simultaneously for task-performance alignment and geometric deployment-readiness.

This is a shift from the standard framing of distillation as purely a knowledge-transfer problem. In the standard framing, the student succeeds if it approximates the teacher’s task performance; geometric properties are a post-hoc concern. SAKD argues that geometric properties can be optimised during distillation at negligible cost (the SAT regularisers add <1% overhead relative to a forward pass), and that doing so can eliminate the need for a separate PTQ step—or at minimum, produce a student that responds better to PTQ when it is applied.

7.2 Connection to the SWAN Paper’s Open Questions

“Combining SWAN’s sensitivity map with calibration-based fine-tuning for a ‘coarse-to-fine’ quantization pipeline”: SAKD is a form of coarse-to-fine pipeline, but with distillation as the refinement stage rather than PTQ fine-tuning. The student’s reduced parameter count makes calibration-based methods more tractable after SAKD training.

“Reconstruction error proxy saturation at the 397B scale”: SAKD’s use of the composite score for loss weighting rather than bit-width allocation means that normalisation saturation is less damaging. A saturated s_err contributes a constant term to all layer weights equally, reverting toward the other three metrics rather than causing incorrect bit assignments.

“Adaptive reconstruction error normalisation across model scales”: When applying SAKD across different teacher sizes, per-architecture normalisation of the reconstruction error proxy should be applied, using the same adaptive scaling the SWAN paper recommends for v3 development.

7.3 Limitations

Unvalidated core hypotheses. The two central claims of §3 are not experimentally verified in this paper. The framework’s value depends on these holding empirically. They are falsifiable and we have described experiments to test them.

SWAN score transferability assumption. We assume that teacher SWAN scores accurately proxy which teacher layers are most important for a student to match. This assumes that sensitivity to quantization noise and sensitivity to student representational error are correlated—a reasonable but unverified assumption. If teacher layers are sensitive to quantization specifically due to outlier activations (rather than weight geometry), the weight-based SWAN metrics may not translate.

Hyperparameter coupling. SAT’s hyperparameters (κ_target, λ_κ, λ_σ, θ_noise, D) were developed for pre-training from scratch. Their optimal values in the distillation fine-tuning setting—where the student starts from a pretrained checkpoint and is trained for far fewer steps—may differ substantially. Systematic hyperparameter exploration is needed.

Metric saturation on smaller models. SWAN’s output noise amplification saturates at 1.0 for all tensors in the 8B-scale model (v1 metrics). This means the composite score on smaller models is effectively three-metric rather than four-metric. The log-scale normalisation in v2 partially addresses this; we recommend applying v2 metrics and validating correlation statistics on any new architecture before relying on the composite for SAKD weighting.

No multi-teacher extension. SAKD is designed for single-teacher distillation. Extension to multi-teacher settings (where different teachers are used for different layer groups) would require a reconciliation of SWAN profiles across teacher architectures.

8. Conclusion

We have presented SAKD, a framework that applies the SWAN/SAT weight-geometry philosophy to the knowledge distillation setting. The framework makes three contributions: sensitivity-weighted feature-matching loss derived from data-free SWAN profiling of the teacher; student geometry regularisation using SAT-derived kurtosis and spectral conditioning; and Targeted Distillation Noise Injection, which co-trains knowledge transfer and quantization hardening in a single pass.

SAKD is grounded in real empirical findings: the SWAN validation of its four metrics against actual quantization error (up to ρ=0.80 for kurtosis, ρ=0.69 for output sensitivity), and SAT’s demonstration that weight geometry can be regulated during training without sacrificing expressiveness. The framework proposes two testable hypotheses—that SWAN-weighted supervision outperforms uniform supervision, and that SGR produces more quantization-ready students—with concrete evaluation protocols described.

The honest assessment is that SAKD makes a coherent and motivated proposal. Whether it works empirically is an open question that requires implementation. The framework’s value lies in treating distillation and deployment-readiness as jointly optimisable, rather than sequential concerns—an architectural choice that, if validated, would simplify the LLM compression pipeline while improving the quality of its outputs.

References

[1] baa.ai (2026). “SWAN: Data-Free Mixed-Precision Quantization for Large Language Models via Multi-Metric Sensitivity Analysis.” baa.ai Research Publication.

[2] baa.ai (2025). “Sensitivity-Aware Training (SAT): Using Statistical Weight Geometry to Guide LLM Training Dynamics.” baa.ai Research Publication, CC BY-NC-ND 4.0.

[3] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” In ICLR, 2023.

[4] Y. Gu, L. Dong, F. Wei, and M. Huang. “MiniLLM: Knowledge Distillation of Large Language Models.” In ICLR, 2024.

[5] G. Hinton, O. Vinyals, and J. Dean. “Distilling the Knowledge in a Neural Network.” NIPS Deep Learning Workshop, 2015.

[6] X. Jiao et al. “TinyBERT: Distilling BERT for Natural Language Understanding.” EMNLP Findings, 2020.

[7] J. Ko et al. “DistiLLM: Towards Streamlined Distillation for Large Language Models.” In ICML, 2024.

[8] J. Lin et al. “AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration.” In MLSys, 2024.

[9] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. “Spectral Normalization for Generative Adversarial Networks.” In ICLR, 2018.

[10] S. Sun et al. “Patient Knowledge Distillation for BERT Model Compression.” In EMNLP, 2019.

[11] Z. Dong et al. “HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision.” In ICCV, 2019.

[12] M. S. Akhondzadeh et al. “KurTail: Kurtosis-based LLM Quantization.” Findings of EMNLP, 2025.

[13] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.” NeurIPS, 35, 30318–30332, 2022.

Appendix A. SWAN Metric Summary (From Source)

The following table reproduces the SWAN metric correlation statistics from the published SWAN paper [1], Table 2, computed on Qwen3.5-397B (2,347 weight tensors). These are the empirical values that ground SAKD’s use of SWAN scores as distillation weights.

Table 2: SWAN metric correlation with 4-bit reconstruction error (Qwen3.5-397B, 2,347 tensors). * Reconstruction error proxy saturates at 1.0 for most tensors at 397B scale, preventing reliable Spearman correlation measurement. SAKD inherits the v2 weight of 0.20 but recommends adaptive normalisation to prevent saturation (see §7.3).

Appendix B. Hyperparameter Reference

Table 3: SAKD hyperparameter reference. All SAT-inherited parameters retain their recommended ranges from the SAT paper. SAKD-specific parameters (τ, β, γ) are initial recommendations pending empirical tuning.

Metric	Spearman ρ	p-value	Pearson r	p-value	SAKD Weight
Excess kurtosis	0.796	<0.001	0.347	<0.001	0.45
Output noise amplification	0.694	<0.001	0.316	<0.001	0.15
SVD spectral concentration	0.399	<0.001	0.298	<0.001	0.20
Reconstruction error proxy	—*	—*	—*	—*	0.20

Parameter	Recommended	Description	Source
κ_target	1.5–2.5	Kurtosis ceiling; above this, the regulariser activates	SAT §6.2
λ_κ	1e-4 – 1e-3	Global kurtosis regularisation coefficient	SAT §6.2
λ_σ	1e-4	Global spectral conditioning coefficient	SAT §6.2
τ_max, τ_min	1.0, 0.3	Temperature annealing range for loss weighting	SAKD §4.3
θ_noise	Top 20%	TDNI activation threshold (SWAN score percentile)	SAT §3.3.2 (TQNI)
b (TDNI)	4	Target bit-width for noise calibration	SAKD §4.5
D (DBWA interval)	1000 steps	Steps between SWAN re-diagnostic checkpoints	SAT §3.4.2
β, γ	0.5, 0.3	Logit-KD and feature-matching loss coefficients	SAKD §4.3