DDK : Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT
Copyright © 2024 Elsevier Ltd. All rights reserved..
Large-scale pre-trained models, such as BERT, have demonstrated outstanding performance in Natural Language Processing (NLP). Nevertheless, the high number of parameters in these models has increased the demand for hardware storage and computational resources while posing a challenge for their practical deployment. In this article, we propose a combined method of model pruning and knowledge distillation to compress and accelerate large-scale pre-trained language models. Specifically, we introduce a dynamic structure pruning method based on differentiable search and recursive knowledge distillation to automatically prune the BERT model, named DDK. We define the search space for network pruning as all feed-forward layer channels and self-attention heads at each layer of the network, and utilize differentiable methods to determine their optimal number. Additionally, we design a recursive knowledge distillation method that employs adaptive weighting to extract the most important features from multiple intermediate layers of the teacher model and fuse them to supervise the student network learning. Our experimental results on the GLUE benchmark dataset and ablation analysis demonstrate that our proposed method outperforms other advanced methods in terms of average performance.
Medienart: |
E-Artikel |
---|
Erscheinungsjahr: |
2024 |
---|---|
Erschienen: |
2024 |
Enthalten in: |
Zur Gesamtaufnahme - volume:173 |
---|---|
Enthalten in: |
Neural networks : the official journal of the International Neural Network Society - 173(2024) vom: 15. März, Seite 106164 |
Sprache: |
Englisch |
---|
Beteiligte Personen: |
Zhang, Zhou [VerfasserIn] |
---|
Links: |
---|
Themen: |
Differentiable methods |
---|
Anmerkungen: |
Date Completed 26.03.2024 Date Revised 26.03.2024 published: Print-Electronic Citation Status MEDLINE |
---|
doi: |
10.1016/j.neunet.2024.106164 |
---|
funding: |
|
---|---|
Förderinstitution / Projekttitel: |
|
PPN (Katalog-ID): |
NLM368574156 |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | NLM368574156 | ||
003 | DE-627 | ||
005 | 20240326235440.0 | ||
007 | cr uuu---uuuuu | ||
008 | 240218s2024 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1016/j.neunet.2024.106164 |2 doi | |
028 | 5 | 2 | |a pubmed24n1348.xml |
035 | |a (DE-627)NLM368574156 | ||
035 | |a (NLM)38367353 | ||
035 | |a (PII)S0893-6080(24)00088-1 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
100 | 1 | |a Zhang, Zhou |e verfasserin |4 aut | |
245 | 1 | 0 | |a DDK |b Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT |
264 | 1 | |c 2024 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ƒaComputermedien |b c |2 rdamedia | ||
338 | |a ƒa Online-Ressource |b cr |2 rdacarrier | ||
500 | |a Date Completed 26.03.2024 | ||
500 | |a Date Revised 26.03.2024 | ||
500 | |a published: Print-Electronic | ||
500 | |a Citation Status MEDLINE | ||
520 | |a Copyright © 2024 Elsevier Ltd. All rights reserved. | ||
520 | |a Large-scale pre-trained models, such as BERT, have demonstrated outstanding performance in Natural Language Processing (NLP). Nevertheless, the high number of parameters in these models has increased the demand for hardware storage and computational resources while posing a challenge for their practical deployment. In this article, we propose a combined method of model pruning and knowledge distillation to compress and accelerate large-scale pre-trained language models. Specifically, we introduce a dynamic structure pruning method based on differentiable search and recursive knowledge distillation to automatically prune the BERT model, named DDK. We define the search space for network pruning as all feed-forward layer channels and self-attention heads at each layer of the network, and utilize differentiable methods to determine their optimal number. Additionally, we design a recursive knowledge distillation method that employs adaptive weighting to extract the most important features from multiple intermediate layers of the teacher model and fuse them to supervise the student network learning. Our experimental results on the GLUE benchmark dataset and ablation analysis demonstrate that our proposed method outperforms other advanced methods in terms of average performance | ||
650 | 4 | |a Journal Article | |
650 | 4 | |a Differentiable methods | |
650 | 4 | |a Knowledge distillation | |
650 | 4 | |a Model compression | |
650 | 4 | |a Network pruning | |
650 | 4 | |a Pre-trained models | |
700 | 1 | |a Lu, Yang |e verfasserin |4 aut | |
700 | 1 | |a Wang, Tengfei |e verfasserin |4 aut | |
700 | 1 | |a Wei, Xing |e verfasserin |4 aut | |
700 | 1 | |a Wei, Zhen |e verfasserin |4 aut | |
773 | 0 | 8 | |i Enthalten in |t Neural networks : the official journal of the International Neural Network Society |d 1996 |g 173(2024) vom: 15. März, Seite 106164 |w (DE-627)NLM087746824 |x 1879-2782 |7 nnns |
773 | 1 | 8 | |g volume:173 |g year:2024 |g day:15 |g month:03 |g pages:106164 |
856 | 4 | 0 | |u http://dx.doi.org/10.1016/j.neunet.2024.106164 |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a GBV_NLM | ||
951 | |a AR | ||
952 | |d 173 |j 2024 |b 15 |c 03 |h 106164 |