Details der Publikation

DDK : Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT

Copyright © 2024 Elsevier Ltd. All rights reserved..

Large-scale pre-trained models, such as BERT, have demonstrated outstanding performance in Natural Language Processing (NLP). Nevertheless, the high number of parameters in these models has increased the demand for hardware storage and computational resources while posing a challenge for their practical deployment. In this article, we propose a combined method of model pruning and knowledge distillation to compress and accelerate large-scale pre-trained language models. Specifically, we introduce a dynamic structure pruning method based on differentiable search and recursive knowledge distillation to automatically prune the BERT model, named DDK. We define the search space for network pruning as all feed-forward layer channels and self-attention heads at each layer of the network, and utilize differentiable methods to determine their optimal number. Additionally, we design a recursive knowledge distillation method that employs adaptive weighting to extract the most important features from multiple intermediate layers of the teacher model and fuse them to supervise the student network learning. Our experimental results on the GLUE benchmark dataset and ablation analysis demonstrate that our proposed method outperforms other advanced methods in terms of average performance.

Medienart:	E-Artikel

Erscheinungsjahr:	2024
Erschienen:	2024

Enthalten in:	Zur Gesamtaufnahme - volume:173
Enthalten in:	Neural networks : the official journal of the International Neural Network Society - 173(2024) vom: 15. März, Seite 106164

Sprache:	Englisch

Beteiligte Personen:	Zhang, Zhou [VerfasserIn] Lu, Yang [VerfasserIn] Wang, Tengfei [VerfasserIn] Wei, Xing [VerfasserIn] Wei, Zhen [VerfasserIn]

Links:	Volltext

Themen:	Differentiable methods Journal Article Knowledge distillation Model compression Network pruning Pre-trained models

Anmerkungen:	Date Completed 26.03.2024 Date Revised 26.03.2024 published: Print-Electronic Citation Status MEDLINE

doi:	10.1016/j.neunet.2024.106164

funding:
Förderinstitution / Projekttitel:

PPN (Katalog-ID):	NLM368574156

Internformat


LEADER	01000caa a22002652 4500
001	NLM368574156
003	DE-627
005	20240326235440.0
007	cr uuu---uuuuu
008	240218s2024 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1016/j.neunet.2024.106164 \|2 doi
028	5	2	\|a pubmed24n1348.xml
035			\|a (DE-627)NLM368574156
035			\|a (NLM)38367353
035			\|a (PII)S0893-6080(24)00088-1
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Zhang, Zhou \|e verfasserin \|4 aut
245	1	0	\|a DDK \|b Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT
264		1	\|c 2024
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Completed 26.03.2024
500			\|a Date Revised 26.03.2024
500			\|a published: Print-Electronic
500			\|a Citation Status MEDLINE
520			\|a Copyright © 2024 Elsevier Ltd. All rights reserved.
520			\|a Large-scale pre-trained models, such as BERT, have demonstrated outstanding performance in Natural Language Processing (NLP). Nevertheless, the high number of parameters in these models has increased the demand for hardware storage and computational resources while posing a challenge for their practical deployment. In this article, we propose a combined method of model pruning and knowledge distillation to compress and accelerate large-scale pre-trained language models. Specifically, we introduce a dynamic structure pruning method based on differentiable search and recursive knowledge distillation to automatically prune the BERT model, named DDK. We define the search space for network pruning as all feed-forward layer channels and self-attention heads at each layer of the network, and utilize differentiable methods to determine their optimal number. Additionally, we design a recursive knowledge distillation method that employs adaptive weighting to extract the most important features from multiple intermediate layers of the teacher model and fuse them to supervise the student network learning. Our experimental results on the GLUE benchmark dataset and ablation analysis demonstrate that our proposed method outperforms other advanced methods in terms of average performance
650		4	\|a Journal Article
650		4	\|a Differentiable methods
650		4	\|a Knowledge distillation
650		4	\|a Model compression
650		4	\|a Network pruning
650		4	\|a Pre-trained models
700	1		\|a Lu, Yang \|e verfasserin \|4 aut
700	1		\|a Wang, Tengfei \|e verfasserin \|4 aut
700	1		\|a Wei, Xing \|e verfasserin \|4 aut
700	1		\|a Wei, Zhen \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t Neural networks : the official journal of the International Neural Network Society \|d 1996 \|g 173(2024) vom: 15. März, Seite 106164 \|w (DE-627)NLM087746824 \|x 1879-2782 \|7 nnns
773	1	8	\|g volume:173 \|g year:2024 \|g day:15 \|g month:03 \|g pages:106164
856	4	0	\|u http://dx.doi.org/10.1016/j.neunet.2024.106164 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a GBV_NLM
951			\|a AR
952			\|d 173 \|j 2024 \|b 15 \|c 03 \|h 106164

DDK : Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT

Zugang & Verfügbarkeit

Zugehörige Publikationen/Bände