Abstract
Long short-term memory (LSTM) networks have addressed the shortcomings of recurrent neural networks, such as vanishing gradients and the lack of ability in developing connections across discontinuous parts of sequences. However, the implementations of state-of-the-art LSTM networks face the computational bottleneck of having multiple high-order matrix-vector multiplications (MVMs). This article presents a generalized approach to accelerate a circulant MVM (C-MVM), and hence, it is applicable to many neural networks. The proposed scheme presents a novel low-complexity distributed arithmetic (DA) architecture for optimizing C-MVMs. Unlike conventional offset binary coding-based DA (OBC-DA), it is based on separate generation and selection of partial products. Only one partial product generator (PPG) with several partial product selectors (PPSs) is required. The complexity of PPSs is reduced by sharing the minterms across Boolean expressions. Fine-grained pipelining is employed to achieve approximately one adder delay. From the implementation results, the proposed design with 512 × 512 LSTM layer occupies 74.54% less core area, consumes 68.66% less core power, offers 2.61 times more throughput, and 3.89 times more hardware efficiency over the best existing design.
Original language | English |
---|---|
Pages (from-to) | 329-338 |
Number of pages | 10 |
Journal | IEEE Transactions on Very Large Scale Integration (VLSI) Systems |
Volume | 28 |
Issue number | 2 |
DOIs | |
Publication status | Published - Feb 2020 |
Keywords
- Distributed arithmetic (DA)
- long short-term memory (LSTM) networks
- offset binary coding (OBC)
- Logic gates
- Hardware
- Table lookup
- Memory Management
- Acceleration
- Complexity theory
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Electrical and Electronic Engineering