TY - JOUR
T1 - Low-Area and Low-Power VLSI Architectures for Long Short-Term Memory Networks
AU - Alhartomi, Mohammed A.
AU - Khan, Mohd. Tasleem
AU - Alzahrani, Saeed
AU - Alzahmi, Ahmed
AU - Shaik, Rafi Ahamed
AU - Hazarika, Jinti
AU - Alsulami, Ruwaybih
AU - Alotaibi, Abdulaziz
AU - Al-Harthi, Meshal
N1 - Publisher Copyright:
© 2011 IEEE.
PY - 2023/12
Y1 - 2023/12
N2 - Long short-term memory (LSTM) networks are extensively used in various sequential learning tasks, including speech recognition. Their significance in real-world applications has prompted the demand for cost-effective and power-efficient designs. This paper introduces LSTM architectures based on distributed arithmetic (DA), utilizing circulant and block-circulant matrix-vector multiplications (MVMs) for network compression. The quantized weights-oriented approach for training circulant and block-circulant matrices is considered. By formulating fixed-point circulant/block-circulant MVMs, we explore the impact of kernel size on accuracy. Our DA-based approach employs shared full and partial methods of add-store/store-add followed by a select unit to realize an MVM. It is then coupled with a multi-partial strategy to reduce complexity for larger kernel sizes. Further complexity reduction is achieved by optimizing decoders of multiple select units. Pipelining in add-store enhances speed at the expense of a few pipelined registers. The results of the field-programmable gate array showcase the superiority of our proposed architectures based on the partial store-add method, delivering reductions of 98.71% in DSP slices, 33.59% in slice look-up tables, 13.43% in flip-flops, and 29.76% in power compared to the state-of-the-art.
AB - Long short-term memory (LSTM) networks are extensively used in various sequential learning tasks, including speech recognition. Their significance in real-world applications has prompted the demand for cost-effective and power-efficient designs. This paper introduces LSTM architectures based on distributed arithmetic (DA), utilizing circulant and block-circulant matrix-vector multiplications (MVMs) for network compression. The quantized weights-oriented approach for training circulant and block-circulant matrices is considered. By formulating fixed-point circulant/block-circulant MVMs, we explore the impact of kernel size on accuracy. Our DA-based approach employs shared full and partial methods of add-store/store-add followed by a select unit to realize an MVM. It is then coupled with a multi-partial strategy to reduce complexity for larger kernel sizes. Further complexity reduction is achieved by optimizing decoders of multiple select units. Pipelining in add-store enhances speed at the expense of a few pipelined registers. The results of the field-programmable gate array showcase the superiority of our proposed architectures based on the partial store-add method, delivering reductions of 98.71% in DSP slices, 33.59% in slice look-up tables, 13.43% in flip-flops, and 29.76% in power compared to the state-of-the-art.
KW - Inner products
KW - long-short term memory
KW - matrix-vector multiplication
KW - recurrent neural networks
KW - VLSI
UR - http://www.scopus.com/inward/record.url?scp=85177033170&partnerID=8YFLogxK
U2 - 10.1109/JETCAS.2023.3330428
DO - 10.1109/JETCAS.2023.3330428
M3 - Article
AN - SCOPUS:85177033170
SN - 2156-3357
VL - 13
SP - 1000
EP - 1014
JO - IEEE Journal on Emerging and Selected Topics in Circuits and Systems
JF - IEEE Journal on Emerging and Selected Topics in Circuits and Systems
IS - 4
ER -