One of the most ubiquitous technological advancements making its way into devices we use every single day is autonomy. Autonomous technology via the use of artificial intelligence (AI) and machine learning (ML) algorithms enables core functions without human interference. As the adoption of ML becomes more widespread, more businesses are using ML models to support mission-critical operational processes. This increasing reliance on ML has created a need for real-time capabilities to improve accuracy and reliability, as well as reduce the feedback loop.
Previously, chip computations were processed in the cloud rather than on-device; today, the AI/ML models required to complete these tasks are too large, costly and computationally hungry to be done locally. Instead, the technology relied on cloud computing, outsourcing data tasks to remote servers via the internet. While this was an adequate solution when IoT technology was in its infancy, it certainly wasn’t infallible — though proven to be a transformational tool for storing and processing data, cloud computing comes with its own performance and bandwidth limitations that aren’t well-suited for autonomy at scale, which needs nearly instantaneous reactions with minimal lag time. To-date, certain technologies have been limited by the parameters of cloud computing.
The need for new processing units
The central processing units (CPUs) commonly used in traditional computing devices are not well-suited for AI workloads due to two main issues:
Latency in data fetching: AI workloads involve large amounts of data, and the cache memory in a CPU is too small to store all of it. As a result, the processor must constantly fetch data from dynamic random access memory (DRAM), which creates a significant bottleneck. While newer multicore CPU designs with multithreading capabilities can alleviate this issue to some extent, they are not sufficient on their own.
Latency in instruction fetching: In addition to the large volume of data, AI workloads require many repetitive matrix-vector operations. CPUs typically use single-instruction multiple data (SIMD) architectures, which means they must frequently fetch operational instructions from memory to be performed on the same dataset. The latest generation of AI processors aims to address these challenges through two approaches: (i) expanding the multicore design to allow thousands of threads to run concurrently, thereby fixing the latency in data fetching, or (ii) building processors with thousands of logic blocks, each preprogrammed to perform a specific matrix-vector operation, thereby fixing the latency in instruction fetching.
First introduced in 1980s, field programmable gate arrays (FPGAs) offered the benefit of being reprogrammable, which enabled them to gain traction in diverse industries like telecommunications, automotive, industrial, and consumer applications. In AI workloads, FPGAs fix latency associated with instruction fetching. FPGAs consist of tens of thousands of logic blocks, each of which is preprogrammed to carry out a specific matrix-vector operation. On the flip side, FPGAs are expensive, have large footprints, and are time-consuming to program.
Graphics processing units (GPUs) were initially developed in the 1990s to improve the speed of image processing for display devices. They have thousands of cores that enable efficient multithreading, which helps to reduce data fetching latency in AI workloads. GPUs are effective for tasks such as computer vision, where the same operations must be applied to many pixels. However, they have high power requirements and are not suitable for all types of edge applications.
Specialized chips, known as AI chips, are often used in data centers for training algorithms or making inferences. Although there are certain AI/ML processor architectures that are more energy-efficient than GPUs, they often only work with specific algorithms or utilize uncommon data types, like 4- and 2-bit integers or binarized neural networks. As a result, they lack the versatility to be used effectively in data centers with capital efficiency. Further, training algorithms requires significantly more computing power compared to making individual inferences, and batch-mode processing for inference can cause latency issues. The requirements for AI processing at the network edge, such as in robotics, Internet of Things (IoT) devices, smartphones, and wearables, can vary greatly and, in cases like the automotive industry, it is not feasible to send certain types of work to the cloud due to latency concerns.
Lastly, application specific integrated circuits (ASICs) are integrated circuits that are tailored to specific applications. Because the entire ASIC is dedicated to a narrow set of instructions, they are much faster than GPUs; however, they do not offer as much flexibility as GPUs or FPGAs in terms of being able to handle a wide range of applications. As a consequence, ASICs are increasingly gaining traction in handling AI workloads in the cloud with large companies like Amazon and Google. However, it is less likely that ASICs will find traction in edge computing because of the fragmented nature of applications and use cases.
The departure from single-threaded compute and the large volume of raw data generated today (making it impractical for continuous transfer) resulted in the emergence of edge computing, an expansion of cloud computing that addresses many of these shortcomings. Development of semiconductor manufacturing processes for ultra-small circuits (7nm and below) that pack more transistors onto a single chip allows faster processing speeds and higher levels of integration. This leads to significant improvements in performance, as well as reduced power consumption, enabling higher adoption of this technology for a wide range of edge applications.
Edge computing places resources closer to the end user or the device itself (at the “edge” of a network) rather than in a cloud data center that oversees data processing for a large physical area. Because this technology sits closer to the user and/or the device and doesn’t require the transfer of large amounts of data to a remote server, edge-powered chips increase performance speed, reduce lag time and ensure better data privacy. Additionally, since edge AI chips are physically smaller, they’re more affordable to produce and consume less power. As an added bonus, they also produce less heat, which is why fewer of our electronics get hot to the touch with extended use. AI/ML accelerators designed for use at the edge tend to have very low power consumption but are often specialized for specific applications such as audio processing, visual processing, object recognition, or collision avoidance. Today, this specialized focus can make it difficult for startups to achieve the necessary sales volume for success due to the market fragmentation.
Supporting mission-critical operational processes at the edge
The edge AI chip advantage proving to be arguably the most important to helping technology reach its full potential is its significantly faster operational and decision-making capabilities. Nearly every application in use today requires near-instantaneous response, whether to generate more optimal performance for a better user experience or to provide mission-critical reflex maneuvers that directly impact human safety. Even in non-critical applications, the increasing number of connected devices and equipment going online is causing bandwidth bottlenecks to become a deployment limitation, as current telecommunications networks may not have sufficient capacity to handle the data volume and velocity generated by these devices.
For example, from an industrial perspective, an automated manufacturing facility is expected to generate 4 petabytes of data every day. Even with the fastest (unattainable) 5G speeds of 10 Gbps, it would take days to transfer a day’s worth of data to the cloud. Additionally, the cost of transferring all this data at a rate of $0.40 per GB over 5G could reach as much as $1.6 million per day. And unsurprisingly, the autonomous vehicle industry will rely on the fastest, most efficient edge AI chips to ensure the quickest possible response times in a constantly-changing roadway environment — situations that can quite literally mean life and death for drivers and pedestrians alike.
Investing in edge AI
Nearly every industry is now impacted by IoT technology, there is a $30 billion market for edge computing advancements. The AI chip industry alone is predicted to increase to more than $91 billion by 2025, up from $6 billion in 2018. Companies are racing to create the fastest, most efficient chips on the market, and only those operating with the highest levels of market and customer focus will see success.
As companies are increasingly faced with decisions regarding investment in new hardware for edge computing, staying nimble is key to a successful strategy. Given the rapid pace of innovation in the hardware landscape, companies seek to make decisions that provide both short-term flexibility, such as the ability to deploy many different types of machine learning models on a given chip, and long-term flexibility, such as the ability to future proof by easily switching between hardware types as they become available. Such strategies could typically include a mix of highly specific processors and more general-purpose processors like GPUs, software- and hardware-based edge computing to leverage the flexibility of software, and a combination of edge and cloud deployments to gain the benefits from both computing strategies.
The startup that is set out to simplify the choice of short-/long-term, compute-/power-constrained environments by getting an entirely new processor architecture off the ground is Quadric. Quadric is a licensable processor intellectual property (IP) company commercializing a fully-programmable architecture for on-device ML inference. The company built a cutting-edge processor instruction set that utilizes a highly parallel architecture that efficiently executes both machine learning “graph code” as well as conventional C/C++ signal processing code to provide fast and efficient processing of complex algorithms. Only one tool chain is required for scalar, vector, and matrix computations which are modelessly intermixed and executed on a single pipeline. Memory bandwidth is optimized by a single unified compilation stack that helps result in significant power minimization.
Quadric takes a software-first approach to its edge AI chips, creating an architecture that controls data flow and enables all software and AI processing to run on a single programmable core. This eliminates the need for other ancillary processing and software elements and blends the best of current processing methods to create a single, optimized general purpose neural processing unit (GPNPU).
The company recently announced its new Chimera™ GPNPU, a licensable IP (intellectual property) processor core for advanced custom silicon chips utilized in a vast array of end AI and ML applications. It is specifically tailored to accelerate neural network-based computations and is intended to be integrated into a variety of systems, including embedded devices, edge devices, and data center servers. The Chimera GPNPU is built using a scalable, modular architecture that allows the performance level to be customized to meet the specific needs of different applications.
One of the key features of the Chimera GPNPU is its support for high-precision arithmetic in addition to the conventional 8-bit precision integer support offered by most NPUs. It is capable of performing calculations with up to 16-bit precision, which is essential for ensuring the accuracy and reliability of neural network-based computations, as well as performing many DSP computations. The Chimera GPNPU supports a wide range of neural network architectures, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) networks. As a fully C++ programmable architecture, a Chimera GPNPU can run any machine learning algorithm with any machine learning operator, offering the ultimate in flexible high-performance futureproofing.