As a social networking giant, Facebook has 2.7 billion users of a range of applications and services per month. With the AI boom sweeping in recent years, Facebook has begun to switch many common computing hardware in the data center to dedicated hardware with better performance, power consumption and efficiency.
Yesterday (March 14, USA Time), Facebook unveiled its "next generation" AI model training hardware platform Zion. It also introduced customized ASIC chips for two other types of computing: Kings Canyon for AI reasoning and Mount Shasta for video transcoding. These new designs mainly focus on AI reasoning, AI training and video transcoding. These calculations are not only heavily loaded and switched to dedicated hardware, but also a rapidly expanding service type in Facebook.
From Contemporary AI Hardware to Next Generation AI Hardware
Facebook has long deployed large-scale AI models for computing in its business, with more than 100 trillion predictions and more than 6 billion language translations per day. Facebook's image recognition model for identifying and classifying content also uses more than 3.5 billion image training. A variety of services using AI help users to better communicate daily, but also provide them with a unique and personalized use experience.
Facebook Self-Research AI PlatformFBLearnerManaged most of Facebook's current AI model pipelines. FBLearner contains tools for feature storage, training process management, reasoning engine management and other corresponding parts of the problem. In addition, Facebook has designed its own hardware based on the Open Computing Program (OCP), which can be used in conjunction with FBLearner to enable Facebook developers to deploy models in large quantities quickly.
After resolving the pressing problem of computing scale, Facebook continues to concentrate on research and development, with the ultimate goal of creating a future-oriented, robust hardware design that is not only transparent to vendors, but also a continuous discrete design concept that maximizes Facebook's operational efficiency. Facebook's answer sheet is their next generation of training and reasoning hardware platform. Lei Feng (Public No. Lei Feng) AI Science and Technology Review is briefly introduced as follows.
AI training with Zion
Zion is Facebook's next generation of large-capacity unified training platform, with the goal of efficiently shouldering the larger computing load in the future. Zion considered how to deal with CNN, LSTM, sparse neural network and other different neural network models efficiently. Zion platform can provide high-speed internal connections with high memory capacity, high bandwidth and flexibility, and provide powerful computing power for key workloads within Facebook.
Zion's design uses Facebook's new supplier transparent OCP acceleration model (OAM). The role of OAM is that Facebook buys hardware from many different hardware vendors, such as AMD, Habana, Graphcore, Intel and Invidia. As long as they develop hardware on the basis of open computing plan (OCP) standards, they can not only help them innovate faster, but also allow Facebook to freely access different hardware platforms and different clothes on the same rack. To expand between servers, only one cabinet network switch is needed. Even as Facebook's AI training load continues to grow and become more complex, the Zion platform can be expanded and processed.
Specifically, Facebook's Zion system can be divided into three parts: an eight-way CPU server, an OCP acceleration module, and a platform motherboard that can install eight OCP acceleration modules.
Left, modular server motherboard, each motherboard can install two CPUs; right: four motherboards, eight CPUs constitute an eight-way server
On the left, an OCP acceleration module; in the middle, eight OCP acceleration modules are installed on a platform motherboard; on the right, a platform with eight acceleration chips is formed.
Diagram of module connection in Zion platform
The design of Zion platform can decouple the memory, computing and network components of the system, and then each can be expanded independently. The eight-way CPU platform in the system can provide super-large DDR memory pools to serve workloads with high memory capacity requirements, such as the embedded tables of sparse neural networks. For the denser CNNs or sparse neural networks, which are more sensitive to bandwidth and computing power, their acceleration mainly depends on the OCP accelerator module connected to each CPU.
There are two kinds of high-speed connections in the system: one connects all CPUs to each other, and the other connects all accelerators to each other. Because of the high memory bandwidth and low memory capacity of the accelerator, Facebook engineers have come up with a way to make efficient use of the total memory capacity: divide the model and memory to a certain extent, and store the more frequently accessed data in the accelerator's memory, while the less frequently accessed data in the CPU's DR memory. Calculations and communications between all CPUs and accelerators are balanced and executed efficiently through high-speed and low-speed interconnected lines.
AI Reasoning with Kings Canyon
Corresponding to the increasing AI training load, the AI reasoning load is also increasing rapidly. In the next generation of design, Facebook and Esperanto, Habana, Intel, Marvell, Qualcomm and other companies work together to develop dedicated ASIC chips that are easy to expand and deploy. Kings Canyon chips can support both INT8 (8-bit integer) computation with a bias on reasoning speed and FP16 (semi-precision floating point) computation with a bias on higher accuracy.
Kings Canyon chips are mounted on the M.2 circuit board; six Kings Canyon chips are installed on each Glacier Point V2 motherboard; and finally, two Glacier Point V2 motherboards and two single-channel servers form a complete Yosemite server.
Facebook's video transcoding ASIC chip Mount Shasta also uses this arrangement.
According to the illustrations and introductions given by Facebook, it seems that only AI training platform Zion has been put into use. AI reasoning chip Kings Canyon, video transcoding chip Mount Shasta and related hardware have not yet seen physical objects. But both sides of Facebook are confident about the design. In the future, they will expose all design and specifications through OCP to facilitate broader collaboration; Facebook will also work with current partners to improve the software and hardware design of the system.
For more details, see Facebook's official introduction:Https://code.fb.com/data-center-engineering/accelerating-infrastructure/