Abstract
Long short-term memory (LSTM) networks are extensively used in various sequential learning tasks, including speech recognition. Their significance in real-world applications has prompted the demand for cost-effective and power-efficient designs. This paper introduces LSTM architectures based on distributed arithmetic (DA), utilizing circulant and block-circulant matrix-vector multiplications (MVMs) for network compression. The quantized weights-oriented approach for training circulant and block-circulant matrices is considered. By formulating fixed-point circulant/block-circulant MVMs, we explore the impact of kernel size on accuracy. Our DA-based approach employs shared full and partial methods of add-store/store-add followed by a select unit to realize an MVM. It is then coupled with a multi-partial strategy to reduce complexity for larger kernel sizes. Further complexity reduction is achieved by optimizing decoders of multiple select units. Pipelining in add-store enhances speed at the expense of a few pipelined registers. The results of the field-programmable gate array showcase the superiority of our proposed architectures based on the partial store-add method, delivering reductions of 98.71% in DSP slices, 33.59% in slice look-up tables, 13.43% in flip-flops, and 29.76% in power compared to the state-of-the-art.
| Original language | English |
|---|---|
| Pages (from-to) | 1000-1014 |
| Number of pages | 15 |
| Journal | IEEE Journal on Emerging and Selected Topics in Circuits and Systems |
| Volume | 13 |
| Issue number | 4 |
| Early online date | 6 Nov 2023 |
| DOIs | |
| Publication status | Published - Dec 2023 |
Keywords
- Inner products
- long-short term memory
- matrix-vector multiplication
- recurrent neural networks
- VLSI
ASJC Scopus subject areas
- Electrical and Electronic Engineering