THE FIRST INTERNATIONAL WORKSHOP
COMPUTER ARCHITECTURE FOR
Portland, OR, USA June 14, 2015
In conjunction with the 42nd International Symposium on Computer Architecture (ISCA-42)
Abstracts and Speakers Bio
Amir Khosrowshahi is CTO and co-founder of Nervana Systems. Amir has an AB in physics and math, AM in physics from Harvard, and a PhD in computational neuroscience from Berkeley. His research work applied unsupervised learning algorithms to large-scale neural recordings from visual cortex. His work experience includes trading derivatives at Goldman, Sachs, working at startups like Tellme Networks, Zappos, and what became Google Sheets, as well as working on neuromorphic processors and analog VLSI sensors at Qualcomm Research.
Computer Architectures for Deep Learning
Deep learning is a branch of machine learning that has achieved state-of-the-art in many domains including images, speech, and text. Nervana Systems is a startup providing deep learning as a cloud platform to a range of verticals including medical imaging, oil and gas, genomics, finance, and e-commerce. Our core technology is a distributed processor architecture for deep learning. I will share some of our experiences optimizing deep learning from algorithms down to hardware to build a compelling service that is scalable and high performance.
Eric S. Chung is a Researcher in Microsoft Research NExT. His research focuses on the intersection of computer architecture and reconfigurable computing with FPGAs, and exploring disruptive uses of specialized hardware in high valued applications such as machine learning. He is a principal developer and contributor to the Catapult project at Microsoft, which uses a fabric of FPGAs to accelerate cloud services at scale in the datacenter. Eric received his PhD in electrical and computer engineering from Carnegie Mellon University and a BS in EECS from UC Berkeley.
Accelerating Deep Convolutional Neural Networks Using Specialized Hardware in the Datacenter
Recent breakthroughs in the development of multi-layer convolutional neural networks have led to state-of-the-art improvements in the accuracy of non-trivial recognition tasks such as large-category image classification and automatic speech recognition. These many-layered neural networks are large, complex, and require substantial computing resources to train and evaluate. Unfortunately, these demands come at an inopportune moment due to the recent slowing of gains in commodity processor performance.
Hardware specialization in the form of GPGPUs, FPGAs, and ASIC offers a promising path towards major leaps in processing capability while achieving high energy efficiency. At Microsoft, an effort is underway to accelerate Deep Convolutional Neural Networks (CNN) using servers in the datacenter augmented with FPGAs. Initial efforts to implement a single-node CNN accelerator on a mid-range FPGA show significant promise, resulting in respectable performance relative to prior FPGA designs, multithreaded CPU implementations and high-end GPGPUs, at a fraction of the power. In the future, combining multiple FPGAs over a low-latency communication fabric offers further opportunity to train and evaluate models of unprecedented size and quality.
Vinayak Gokhale is a Ph.D. student at e-Lab and is working under the direction of Dr. Eugenio Culurciello at Purdue University. He received the Bachelor’s and Master’s degrees, both in Electrical and Computer Engineering, in 2011 and 2014 respectively from Purdue University. Vinayak has worked on the design of custom architectures for accelerating convolutional neural networks for the better part of his Ph.D and had an invited paper at the Embedded Vision Workshop (part of the Computer Vision and Pattern Recognition Conference), 2014.
A Hardware Accelerator for Convolutional Neural Networks
Machine learning is making its way into more and more products each day. A large part of the population obliviously uses machine learning on a daily basis. As the number of products incorporating these algorithms grows by the day, there is a push to develop custom hardware to accelerate these algorithms and decrease the power consumed in their processing. Several factors make the design of machine learning specific custom architectures attractive. One is the large number of computations performed on the input data. Others include memory access patterns, the size of intermediate data produced and the inability of general purpose hardware to fully exploit the inherent parallelism in many of these algorithms. In my talk, I will focus on an architecture targeting one specific algorithm – convolutional neural networks (CNNs). CNNs have gained traction over the past two years and a large part of the industry has already incorporated CNNs in products and services. I will discuss in detail the challenges to designing a hardware accelerator for CNNs and the approach we took to tackle those challenges.
Paul Burchard is a Managing Director at Goldman Sachs Group. Prior to working at Goldman, he invented techniques for manufacturing integrated circuit patterns much smaller than the diffraction limit in research at UCLA, and designed software for ASML. He designed his first parallel computing chip in the 1980s while obtaining his Ph.D. in mathematics from the University of Chicago.
Hardware Acceleration for Communication-Intensive Algorithms
A broad class of algorithms used in machine learning, scientific computing, and other areas can benefit from a high degree of connectivity between parallel computing cores, and furthermore, from the ability to perform in parallel a small amount of computation on the communicated data. We take inspiration from the brain, which is a highly connected architecture where the synapses perform some computation on data coming into the neuron. The need for better connectivity in parallel computing has long been understood, but has been held back from the mainstream by costly, specialized hardware and difficult programming interfaces. We propose a new general-purpose connectionist hardware architecture that should be possible to manufacture at reasonable cost due to increasing acceptance of 3D chip manufacturing techniques, and can be naturally programmed via simple SIMD “synapse” coprocessors attached to each core.
Jonathan Pearce is a research scientist in the Accelerator Architecture Lab at Intel Corporation. His research interests include heterogeneous compute architectures and specialized programmable hardware designs. Before joining Intel Labs, he was a parallel compute architect for Intel integrated graphics, where his responsibilities included debug of the world’s first OpenCL 2.0 platform for the Broadwell processor and definition of new architectural features for increased compute efficiency on future Intel integrated graphics processors. Jonathan has worked previously on coherent fabric performance for the 1st generation Core i7/i5/i3, called Nehalem. He also worked briefly on simulation and microarchitecture definition of a throughput computing core targeting graphics processing. His first significant contribution to Intel was tuning aggressiveness of the Prescott hardware prefetcher using a genetic algorithm. Jonathan earned both M.S. and B.S. degrees in electrical and computer engineering from Carnegie Mellon University.
You Have No (Predictive) Power Here, SPEC!
Many machine learning applications are compute bound—or should be with proper software—yet the traditional benchmarks for parallel computing are a poor proxy for performance on ML applications. We desire a benchmark for machine learning to stimulate work in computer architecture in this area and to provide ranking of computer systems that predicts their suitability for machine learning applications. We break down the machine learning landscape into compute patterns relevant to computer architects and explore these compute patterns in terms of characteristic bottlenecks. We show the characteristic bottlenecks of ML applications depend upon data dependent operations and sequential fractions of execution that are poorly represented by benchmarks available from SPEC. Even SPEC subtests which exercise the relevant compute patterns fail to do so in a way relevant to machine learning. We conclude with the aspects of a focused machine learning benchmark suite: large and very large input data, not bit-accurate, a goodness metric incorporating both accuracy and performance, and a modular approach to kernel implementation to incorporate new developments in ML best practices and algorithms
Scott Beamer is a Computer Architecture PhD candidate at UC Berkeley advised by Krste Asanović and David Patterson. He is currently investigating how to accelerate graph algorithms through software optimization and hardware specialization. In the past, he looked into how to best use monolithically integrated silicon photonics to create memory interconnects. He received a B.S. in Electrical Engineering and Computer Science and a M.S. in Computer Science, both from UC Berkeley.
Graph Processing Bottlenecks of an Ivy Bridge Server
Graph processing is experiencing a surge of interest, as applications in social networks and their analysis have grown in importance as well as new applications in recognition and the sciences. Graph algorithms are notoriously difficult to execute efficiently, so there has been considerable recent effort in improving the performance of processing large graphs for these important applications. In this work, we focus on the performance of a shared-memory multiprocessor node executing optimized graph algorithms. We analyze the performance of three highperformance graph processing codebases each using a different parallel runtime, and we measure results for these graph libraries using five different graph kernels and a variety of large input graphs. We use microbenchmarks and hardware performance counters to analyze the bottlenecks these optimized codes experience when executed on a modern Intel Ivy Bridge server. From our analysis, we derive insights that contradict some prior conventional wisdom about graph processing workload characteristics. Contrary to the notion that graph algorithms have a random memory access pattern, we find these well-tuned parallel graph codes exhibit substantial locality and thus experience a moderately high hit rate in the last-level cache (LLC). These well-tuned graph codes also frequently struggle to fully utilize the off-chip memory system and are thus limited by memory latency, not memory bandwidth. We find the reorder buffer size to the be biggest limiter of memory throughput as it is not large enough to hold enough instructions to expose the relatively rare LLC-missing loads early. In this context (superscalar out-of-order multicore), we find multithreading to have only modest room for performance improvement on graph codes. Additionally, we find that different input graph sizes and topologies can lead to very different conclusions for algorithms and architectures, so it is important to consider a range of input graphs. Given our observations of simultaneous low compute and bandwidth utilization, we find there is substantial room for a different processor architecture to improve performance without requiring a new memory system. In our talk, we discuss our empirical results and make recommendations for future work in both hardware and software to improve graph algorithm performance.
James E. Smith is Professor Emeritus with the Department of Electrical and Computer Engineering at the University of Wisconsin-Madison. He received his PhD from the University of Illinois in 1976. He then joined the faculty of the University of Wisconsin-Madison, teaching and conducting research first in fault-tolerant computing, then in computer architecture. He has been involved in a number of computer research and development projects both as a faculty member at Wisconsin and in industry.
Prof. Smith has made a number of significant contributions to the development of superscalar processors. These contributions include basic mechanisms for dynamic branch prediction and implementing precise traps. He has also studied vector processor architectures and worked on the development of innovative microarchitecture paradigms. He received the 1999 ACM/IEEE Eckert-Mauchly Award for these contributions. Currently, he is studying computational neuroscience at home along the Clark Fork near Missoula, Montana.
Biologically Plausible Spiking Neural Networks
This talk describes a distributed cognitive architecture and implementations based on models that employ neuron-like operation. Two main topics are covered.
The first is a time abstraction which supports communication and computation using actual spike timing relationships. This abstraction provides a system design model that performs sequences of neural operations in abstract steps, with actual timing details being hidden. Meanwhile, each modeled neuron operates on inputs, and produces outputs, using its own local frame of reference based on actual times. The second is a biologically plausible spiking neuron model, in which each neuron operates within its own temporal frame of reference. Multiple synaptic paths connect pairs of neurons, with the paths exhibiting a range of delays. Weights are established via a form of conventional spike time dependent plasticity. The resulting neuron yields aligned compound spike responses when an evaluation spike pattern matches the training pattern(s). An abstract version of this model is suitable for machine learning applications.
Giacomo Indiveri is a Professor at the Faculty of Science of the University of Zurich, Switzerland. He obtained an M.Sc. degree in Electrical Engineering and a Ph.D. degree in Computer Science from the University of Genoa, Italy. Indiveri was a post-doctoral research fellow in the Division of Biology at the California Institute of Technology (Caltech) and at the Institute of Neuroinformatics of the University of Zurich and ETH Zurich, where he attained the Habilitation in Neuromorphic Engineering in 2006. He is an ERC fellow and an IEEE Senior member. His current research interests lie in the study of real and artificial neural processing systems, and in the hardware implementation of neuromorphic cognitive systems, using full custom analog and digital VLSI technology.
Neuromorphic circuits and for building autonomous cognitive systems
Neuromorphic computing aims to reproduce the principles of neural computation by emulating as faithfully as possible the detailed biophysics of the nervous system in hardware. Examples of neuromorphic computing systems include full custom mixed-signal analog VLSI devices that implement spiking neural networks. In this presentation I will describe neuromorphic electronic circuits for directly emulating the properties of neurons and synapses, and show how they can be configured to implement real-time compact neural processing systems. I will present multi-core architectures with analog circuits for the synapse and neural dynamics and asynchronous digital routing circuits for transmission of spikes and configuration of different types of networks architectures, such as convolutional nets or deep networks. I will show examples of networks and systems that implement on-chip on-line learning for pattern recognition and classification tasks. I will discuss the possible applications and tasks that can take best advantage of such technology and present an outlook for future scaled autonomous neuromorphic systems.
Yiran Chen received his Ph.D. from Purdue University and now is an associate professor with University of Pittsburgh, Electrical and Computer Engineering department. He has authored more than 200 technical publications, received 86 U.S. patents, and serves as associated editor of several ACM and IEEE journals and transactions. He received several best paper awards and nominations, and many other professional awards including NSF CAREER award, ACM SIGDA outstanding new faculty award, and was the invitee of 2013 U.S. Frontiers of Engineering Symposium of National Academy of Engineering. He now is co-directing Evolutionary Intelligence Laboratory (www.ei-lab.org), working on nonvolatile memory, neuromorphic computation, and mobile systems.
Hardware Acceleration for Neuromorphic Computing: An Evolving View
The rapid growth of computing capacity of modern microprocessors enables the wide adoption of machine learning and neural network models. The ever-increasing demand for performance, combining with the concern on power budget, motivated the recent research on hardware acceleration for these learning algorithms. A wide spectrum of hardware platforms have been extensively studied, from conventional heterogeneous computing systems to emerging nanoscale systems.In this talk, we will review the ongoing efforts at Evolutionary Intelligence Laboratory (www.ei-lab.org) about hardware acceleration for neuromorphic computing and machine learning. Realizations on various platforms such as FPGA, on-chip heterogeneous processors, and memristor-based ASIC designs will be explored. An evolving view of the accelerator designs for learning algorithms will be also presented.
Mikko Lipasti is the Reed Professor of Electrical and Computer Engineering at the University of Wisconsin-Madison. He has published over 100 peer-reviewed papers in journals, conferences, and workshops, has advised 17 Ph.D.’s to completion, and has recently co-founded a venture-funded technology startup to commercialize his research in neuromorphic computing. Before joining Wisconsin, he worked for six years in server system development at IBM Corporation. He was named an IEEE Fellow (class of 2013) "for contributions to the microarchitecture and design of high-performance microprocessors and computer systems.” His primary research interests include high-performance, low-power, and reliable processor cores; networks-on-chip for many-core processors; and fundamentally new, biologically-inspired models of computation
Mimicking the Self-Organizing Properties of the Visual Cortex
The ‘blank slate’ hypothesis states that much of the structure, connectivity, and functionality of the mammalian cortex emerges in response to stimuli received early in development, leading to specialized cortical structures tailored for various sensory modalities. By implication, the emergence of these cortical structures must be governed by some as-yet-undiscovered set of principles for self-organization that are implemented in the learning rule that the cortex employs early in its development. In this talk, I will describe some initial work towards understanding these principles and learning rules in the context of map-seeking circuits, a biologically-inspired mathematical model of invariant object recognition in the visual cortex. Map-seeking circuits (MSC) implement a hierarchy of image transforms (maps) that provide scale, rotation, and translation invariance so that a single pre-loaded invariant memory of an object can be matched against transformed versions of the object in the visual input stream. MSC performs an efficient search of a large repertoire of pre-programmed maps, which are manually organized into orthogonal layers. After iterating to convergence, MSC returns a vector that identifies both the matched object (a standard classification result) as well as the set of maps used to determine the match. We recently demonstrated a self-organizing variant of MSC (SO-MSC), which initially contains no maps, no hierarchy, and no invariant object memories, but only a simple set of self-organizing principles that are used to adapt its structure in response to visual stimuli. SO-MSC acquires new memories in an unsupervised manner simply through exposure to objects in its visual input stream, and learns invariant transforms automatically by observing frame-to-frame differences in persistent objects, recording these as new maps, and organizing these maps into orthogonal layers based on their mathematical properties. The resulting SO-MSC generalizes transforms learned from one object and robustly recognizes test images with high degrees of variation in scale, rotation, translation, illumination, and perspective.
Shai Fine is a Principal Engineer at the Advanced Analytics group in Intel, focusing on Machine Learning, Business Intelligence, and Big Data. Prior to Intel, Shai worked for the IBM Research Lab in Haifa, managing the Analytics research department. Shai received his Ph.D. in 1999 in computer science, from the Hebrew University in Israel, and conducted his postdoctoral research at the Human Language Technologies department in IBM’s T.J. Watson Research Center in New York. Shai has published over 30 papers in referred journals and conference proceedings, and co-invented 10 patents in various domains of Advanced Analytics.
Machine Learning Building Blocks
These are exciting times for the Machine Learning realm – Big Data Analytics attracts an increasing interest, more than ever before. This, in turn, creates a flood of innovative ideas, problems and tasks to handle, and new modelling techniques, algorithms, methodologies and tools are springing up like mushrooms after rain. This also poses some challenges for technologists that strive to keep in pace with the explosion of new algorithmic and modelling toolsets, and provide relevant and competitive solutions. The goal of this work is to help closing this gap. To this end, we will introduce the concept of Machine Learning Building Blocks, which is a finite set of elements that can be mapped into hardware and software primitives and patterns. We will provide some intuition for the definition of the basic building blocks, and specific examples for the mapping to commonly used algorithms and modeling techniques, data characteristics, and usage scenarios. Finally, we will demonstrate the implication of the building blocks concept, for example in designing representative workloads that designers, developers and vendors can profile and analyze.
Chunkun Bo is a Ph.D. student in Dept. of Computer Science, University of Virginia. He received his Bachelor’s degree and Master’s degree from the Harbin Institute of Technology and University of Science and Technology of China respectively. His current research focuses on heterogeneous computing.
String Kernel Testing Acceleration using the Micron Automata Processor
String kernel (SK) is a widely used technique in the Machine Learning field, particularly for biological sequences analysis. Instead of comparing whole sequences, string kernel methods count the occurrences of representative short subsequences, called K-mers. In the string kernel model, a mapping function projects the input strings to a higher-dimensional (number of K-mers) feature space. High-performance processing is required in the testing phase of SK, especially for large databases. Feature vector computation is the current performance bottleneck of the SK testing stage and involves a lot of pattern-matching work. A direct computation is computationally expensive even for small values of K, because the dimension of the feature space grows exponentially with K. To speed up the testing phase of SK, we propose a hardware acceleration solution based on Micron's Automata Processor. Micron's Automata Processor (AP) is an efficient and scalable semiconductor architecture for parallel automata processing. It is a hardware implementation of non-deterministic finite automata and is capable of matching complex patterns in parallel. This architecture has shown outstanding performance in association rule mining and DNA motif search [Wang et al., IPDPS’15; Roy et al., IPDPS’14]. Our proposed method stores interesting K-mers on the AP and these K-mers match the input sequence in parallel. We can also implement different mapping rules on the AP, such as exact match, mismatch and gappy match. This method works especially well for genomics for the following two reasons. First, genome data has a small alphabet size Σ (A T G C). The total number of possible K-mers equals ΣK, much smaller than data with larger alphabet size. Secondly, the AP can efficiently match many permutations of a given pattern, which helps to deal with mutations that are of great importance in many genomic contexts. Performance results show great potential for the proposed method to compute feature vectors and accelerate the SK testing phase. It achieves 8x, 25x, 139x, 418x 1438x and 3978x speedups over PatMaN (a rapid CPU alignment tool) for mismatch=0, 1, 2, 3, 4， 5 respectively. The speedup increases exponentially as more mismatches allowed.
Ran Ginosar received BSc from the Technion and PhD from Princeton University. His PhD research focused on shared-memory multiprocessors. He conducted research at AT&T Bell Laboratories in 1982-1983, and joined the Technion faculty in 1983. He was a visiting Associate Professor with the University of Utah in 1989-1990, and a visiting faculty with Intel Research Labs in 1997-1999. His research interests include VLSI architecture, many-core computers, big data machine learning accelerators, asynchronous logic and synchronization, networks on chip and biologic implant chips. He has co-founded several companies in various areas of VLSI systems.
Accelerators for Machine Learning of Big Data
We address “big data” where a single machine learning application needs to readily access one Exabytes. A 100 MWatt data center would require an hour to access the entire data set once (due to power constraints). Machine learning tasks call for sparse matrix operations such as matrix-vector and matrix-matrix multiplications on either floating-point, integer or binary data. “Big data machine learning” means that every item in the Exabyte data set needs to be correlated with a large portion of all other data items of the same Exabyte data set (a super-linear operation). With only 100 MWatt, such computation would take days to merely access and move the data around, even before counting the required CPU power. We consider 3-D chips integrating massive storage with massively parallel accelerators in order to obtain 10-100x improvement in energy requirements for data access and processing of big data machine learning tasks. Associative Processors (AP) combine data storage and processing per each memory cell, and function simultaneously as both a massively parallel computing array and a memory. Using resistive CAM enables AP with a few hundreds of millions of processing units on a single silicon die. We also study GP-SIMD, a more conventional parallel array combined with massive memory and general processors on-die. A resistive GP-SIMD achieves two orders of magnitude improvement in density compared to CMOS-based architectures.