Issue No. 02 - April-June (2018 vol. 4)
ISSN: 2332-7766
pp: 99-112
Igor Loi , DEI, University of Bologna, Bologna, Emilia Romagna, Italy
Alessandro Capotondi , DEI, University of Bologna, Bologna, Emilia Romagna, Italy
Davide Rossi , DEI, University of Bologna, Bologna, Emilia Romagna, Italy
Andrea Marongiu , DEI, University of Bologna, Bologna, Emilia Romagna, Italy
Luca Benini , DEI, University of Bologna, Bologna, Emilia Romagna, Italy
ABSTRACT
High performance and extreme energy efficiency are strong requirements for a fast-growing number of edge-node Internet of Things (IoT) applications. While traditional Ultra-Low-Power designs rely on single-core micro-controllers (MCU), a new generation of architectures leveraging fully programmable tightly-coupled clusters of near-threshold processors is emerging, joining the performance gain of parallel execution over multiple cores with the energy efficiency of low-voltage operation. In this work, we tackle one of the most critical energy-efficiency bottlenecks for these architectures: instruction memory hierarchy. Exploiting the instruction locality typical of data-parallel applications, we explore two different shared instruction cache architectures, based on energy-efficient latch-based memory banks: one leveraging a crossbar between processors and single-port banks (SP), and one leveraging banks with multiple read ports (MP). We evaluate the proposed architectures on a set of signal processing applications with different executable sizes and working-sets. The results show that the shared cache architectures are able to efficiently execute a much wider set of applications (including those featuring large memory footprint and irregular access patterns) with a much smaller area and with much better energy efficiency with respect to the private cache. The multi-port cache is suitable for sizes up to a few kB, improving performance by up to 40 percent, energy efficiency by up to 20 percent, and energy $\times$ area efficiency by up to 30 percent with respect to the private cache. The single-port solution is more suitable for larger cache sizes (up to 16 kB), providing up to 20 percent better energy $\times$ area efficiency than the multi-port, and up to 30 percent better energy efficiency than private cache.
INDEX TERMS
Memory management, Program processors, Standards, Random access memory, Microprocessors, Libraries
CITATION

I. Loi, A. Capotondi, D. Rossi, A. Marongiu and L. Benini, "The Quest for Energy-Efficient I\$ Design in Ultra-Low-Power Clustered Many-Cores," in IEEE Transactions on Multi-Scale Computing Systems, vol. 4, no. 2, pp. 99-112, 2018.
doi:10.1109/TMSCS.2017.2769046