Lightweight Fine-tuning by Replacing FFN with Spline-KAN in BERT: Two-stage Training and Comparison with BitFit
DOI: https://doi.org/10.62517/jike.202604105
Author(s)
Mingyang Song
Affiliation(s)
Internet of Things Engineering, International School, Beijing University of Posts and Telecommunications, Beijing, China
Abstract
This paper proposes replacing the feed-forward sublayer (FFN) in BERT with a spline-based Kolmogorov–Arnold Network (referred to as Spline-KAN) and designs a two-stage training procedure to achieve efficient fine-tuning under a strict parameter budget. The two-stage procedure is as follows: first perform warm-up training of the KAN module (only unfreezing the KAN and the classification head), then fine-tune under an extremely small set of trainable parameters using a BitFit-style step (training only biases and spline control points). Controlled experiments were conducted on the eprstmt subset of FewCLUE using identical random seeds and training settings (main configuration: G = 16, intermediate = 512, single vGPU 48GB). Results show that under this main configuration the two-stage KAN method significantly outperforms BitFit-only on the validation set (mean improvement ≈ 21.25 percentage points, paired t(4) = 5.54, p = 0.0052, Cohen's d ≈ 2.48), and also significantly outperforms the full-parameter baseline under the same configuration (mean improvement ≈ 8.38 percentage points, paired t(4) = 3.16, p = 0.0341, Cohen's d ≈ 1.41). The experiments demonstrate that, within an approximately 0.2M trainable-parameter budget, structural replacement combined with staged training can yield substantial accuracy gains, offering a practical path for deploying pretrained Transformers in resource-constrained scenarios.
Keywords
Parameter-Efficient Fine-Tuning; Spline-KAN; BitFit; BERT; Two-stage Training
References
[1] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 4171–4186.
[2] Houlsby, N., Giurgiu, A., Jastrzębski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Proceedings of Machine Learning Research, Vol. 97, pp. 2790–2799. Preprint: arXiv:1902.00751.
[3] Ben-Zaken, E., Ravfogel, S., & Goldberg, Y. (2021). BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. arXiv preprint arXiv:2106.10199.
[4] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685.
[5] Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T. Y., & Tegmark, M. (2024). KAN: Kolmogorov–Arnold Networks. arXiv preprint arXiv: 2404.19756.
[6] Kolmogorov, A. N. (1957). On the representation of continuous functions of many variables by superpositions of continuous functions of one variable and addition. Doklady Akademii Nauk SSSR (Russian original). English translation: Kolmogorov, A. N. (1963). On the representation of continuous functions of many variables by superpositions of continuous functions of one variable and addition. American Mathematical Society Translations, Series 2, Vol. 28, pp. 55–59.
[7] Arnold, V. I. (1959). On functions of three variables. (Constructive supplement to Kolmogorov’s theorem.) English translation commonly cited as: Arnold, V. I. (1963). On functions of three variables. American Mathematical Society Translations, Series 2, Vol. 28, pp. 1–16.
[8] Zhang, X., & Zhao, Y. (2023). "On the Efficiency of Feed-Forward Networks in Transformers". Journal of Machine Learning Research (JMLR), 25(100), 1-15.
[9] Radford, A., & Narasimhan, K. (2023). "Scaling Laws for Neural Networks and Their Application to Transformers". Proceedings of the 2023 Conference on Neural Information Processing Systems (NeurIPS 2023).
[10] Xu, L., & Zhang, W. (2023). "Efficient Fine-Tuning of Pretrained Models for NLP Tasks". Proceedings of the 2023 International Conference on Machine Learning (ICML 2023).
[11] Li, H., & Chen, J. (2024). "Low-Rank Adaptation for Efficient Fine-Tuning of Pretrained Transformers". Journal of Artificial Intelligence Research (JAIR), 61, 1-18.
[12] Wang, S., & Yang, Y. (2024). "Spline-Based Models for High-Dimensional Approximation". Proceedings of the 2024 Conference on Machine Learning and Artificial Intelligence (MLAI 2024).
[13] Yang, H., & Lu, X. (2025). "Efficient Parameter Control for Transformers in Resource-Constrained Scenarios". Proceedings of the 2025 International Conference on Learning Representations (ICLR 2025).