In this article, Alastair Lowe provides an overview of the background and technical requirements of AI accelerators – data processors that are specifically designed to perform hardware-accelerated AI tasks.
Artificial Intelligence (AI) and Machine Learning (ML) are booming technology areas that already impact our daily lives in many ways. For example, AI is already used in diverse applications such as facial and number plate recognition software, self-organising communication networks, speech recognition, navigation, self-driving vehicles, and even visual artistry and music. Increasing AI workloads in these and other areas has led to a huge increase in interest in hardware acceleration of AI-related tasks.
Processors that are specifically designed to perform AI-related tasks are not new. Back in 1989, for instance, Intel launched a new analog processor, the ETANN 80170NX, that was designed to implement a neural network. Later, all-digital and hybrid analog/digital processors were also used for AI and ML tasks. In the case of all-digital designs, general-purpose digital signal processors, Central Processing Units (CPUs) and even Graphics Processing Units (GPUs) have all been used to perform AI calculations.
Modern AI tasks, in particular those that implement Neural Networks such as Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs), rely on large numbers of calculations performed in parallel. Therefore, processing throughput and efficiency may be improved by taking advantage of this parallelism. Graphical Processing Units (GPUs) in particular, which are used in PCs to render 3D scenes for gaming and 3D modelling, can also be well-suited for some AI tasks. The main reason for this is that GPUs are generally designed to perform a large number of complex calculations in parallel, more quickly than a typical CPU. However, in recent years the cost of advanced GPUs has been high, while their availability has been constrained. In addition, advanced GPUs tend to have a very high power consumption. These issues mean that GPUs are often not well-suited for certain scenarios, such as mobile or data-centre oriented applications. In addition, GPUs may not be optimised specifically for AI tasks and as such their speed or efficiency may not be as high as they could be. However, other solutions such as using CPUs may be sub-optimal, as the general-purpose nature of CPUs may mean that they are not very efficient or quick for AI tasks.
Hence, there is currently an increased interest in processors that are efficient or fast, or both, when carrying out AI tasks. Processors that are designed specifically for efficient AI or ML processing are often referred to as AI accelerators. However, in certain AI tasks at least, such as those based on Neural Networks, there is a two-stage process whereby each stage has quite different requirements when it comes to processing capabilities. The two stages are called training and inference.
In a typical Neural Network, the training stage uses training data to train an AI algorithm to produce a trained AI model. The AI model can then be deployed to perform inference on new data and to provide classification or predictions.
Training an AI model can be an computationally-intensive process. In one example, training data (such as an image of a particular object) is provided to a deep neural network, and the neural network will provide an output to indicate whether the neural network recognises the object or not. If the neural network provides the wrong answer, e.g. it fails to recognise an object it is being trained to recognise, this error is fed back to the neural network and a large number of weights may be updated. The neural network may then have another guess, on many different training data, until its output is reliable enough to be able to be put to use.
A typical deep neural network might have billions of items of training data, and the model may be trained on the entire training dataset tens or even hundreds of times. The larger the training dataset, the more accurate the resulting model may be, at the expense of increased training processing time. Therefore, there may be an enormous number of parallel calculations and repetitions to perform during a typical training process. On one particular example, training a Chinese speech recognition model required four terabytes of training data and 20 exoflops of computing power. Such power is unlikely to be available in many scenarios due to the large capital expenditure required to deliver sufficient computing power. As a result, training of AI models is often restricted to data centre applications such as Amazon Web Services (AWS) and Microsoft’s Azure, which can provide the required computational power. Other names that provide IP in the training hardware space include Google with their Tensor Processing Units (TPUs), Nvidia’s A100 Tensor Cores, and more recently ARM’s Neoverse cores.
Once the AI model has been trained, though, the trained model may be used multiple times for inference of real-world data. AI inference has far lower processing requirements and can therefore be employed in a wider range of scenarios. In a mobile phone, for example, AI inference using a trained model can be used to provide voice recognition for a digital assistant, or to locally process photos to recognise faces and other objects. The reduced processing requirements also lead to different hardware requirements for a suitable processor. Where processors for training may include networking functionality so that multiple processors may be used, for example, processors for inference may not need such hardware as they are used individually.
Precision of calculations is also an important factor in training and inference, though perhaps in a counter-intuitive manner. Often, low precision calculations are acceptable, and so an AI processor may have explicit support for such calculations in order to increase their speed or efficiency. In an example, 16-bit floating point precision values (referred to as bfloat16) may be used, with 8 bits for an exponent and 8 bits for a significand, in contrast with a more conventional 32-bit floating point representation, which uses 8 bits for the exponent and a much higher 24 bits for the significand. Thus, a considerable amount of precision can be sacrificed for speed and efficiency, with little effect on the efficacy of a model’s output. This can be the case both for training, where huge numbers of calculations may be performed, as well as for inference, where for example in an edge scenario there may be strict constraints on processing power and efficiency. There is also often an emphasis on supporting low precision integer arithmetic, such as INT8 or even INT4 arithmetic, for further efficiency at the cost of even lower precision.
AI training processors and inference processors can be quite different, even if fundamentally they may perform similar calculations, such as tensor-based calculations in Neural Networks. As the processing requirements for inference processors tend to be much lower than those for training, they may be of much lower cost. Hence, there are many more players in the inference hardware space providing cores or ASICs for accelerated deployments. These include Intel, Google, Nvidia and ARM, as well as Cerebras Systems, Graphcore, and others. Xilinx even offers AI engines for FPGAs. In addition, general purpose chip designers may include AI acceleration functionality in their latest designs, with an example of the Qualcomm Snapdragon mobile processor including a Qualcomm AI Engine.
Clearly, then, there is considerable interest in the AI and ML space, and thus considerable interest in providing hardware for accelerating these burgeoning technologies. As AI is expected to become even more prevalent in many areas of technology, this interest is only expected to increase.
In his next article, Alastair will explore specific hardware features of example processors for both training and inference.