The Community for Technology Leaders
2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) (2010)
Vienna, Austria
Sept. 11, 2010 to Sept. 15, 2010
ISBN: 978-1-5090-5032-1
pp: 273-283
Srihari Cadambi , NEC Laboratories America, Inc., 4 Independence Way, Princeton NJ 08540. USA
Abhinandan Majumdar , NEC Laboratories America, Inc., 4 Independence Way, Princeton NJ 08540. USA
Michela Becchi , NEC Laboratories America, Inc., 4 Independence Way, Princeton NJ 08540. USA
Srimat Chakradhar , NEC Laboratories America, Inc., 4 Independence Way, Princeton NJ 08540. USA
Hans Peter Graf , NEC Laboratories America, Inc., 4 Independence Way, Princeton NJ 08540. USA
ABSTRACT
For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5–10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz.
INDEX TERMS
machine learning, Accelerator-based systems, parallel computing, heterogeneous computing
CITATION
Srihari Cadambi, Abhinandan Majumdar, Michela Becchi, Srimat Chakradhar, Hans Peter Graf, "A programmable parallel accelerator for learning and classification", 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), vol. 00, no. , pp. 273-283, 2010, doi:
175 ms
(Ver 3.3 (11022016))