Yesterday, Facebook posted news on its code. fb. com website, releasing Zion, its next generation hardware platform for AI training, Kings Canyon, a new custom chip for AI reasoning, and Mount Shasta for video transcoding.
According to the news, Facebook's infrastructure serves more than 2.7 billion people on its entire application and service system every month. Engineers have designed and created efficient systems to extend this infrastructure, but as workloads grow, general purpose processors alone are no longer able to meet the needs of these systems.
Creating efficient solutions for infrastructure requires optimizing hardware through workloads created through collaborative design work. To this end, Facebook has been working with partners to develop solutions for AI reasoning, AI training and video transcoding. Transistor growth has slowed down dramatically, which requires the development of dedicated accelerator and overall system-level solutions to improve performance, power and efficiency.
The entire Facebook infrastructure uses AI workloads, which makes its services more relevant and improves the user experience when serving. Help people interact with each other daily and provide them with unique personalized services. By deploying the AI model on a large scale, Facebook can provide 200 trillion guesses and over 6 billion language translations per day. Facebook uses more than 3.5 billion public images to build and train its AI model to better identify and tag content.
Most AI processes on Facebook are managed through the FBLeaner platform, which includes tools to deal with various parts of the problem, such as functional storage, training workflow management and reasoning engine. By cooperating with Facebook design hardware released by Open Compute Project (OCP), it can efficiently deploy large-scale model. Starting from a stable foundation, Facebook focused on creating vendor-independent unified hardware designs, and continued to adhere to the decomposition design principle to maximize work efficiency, eventually introducing the next generation of hardware for workload training and reasoning.
AI Training System Zion
Zion is Facebook's next-generation large storage unified training platform, designed to efficiently handle a series of neural networks including CNN, LSTM and SpaseNN. Zion platform can provide high-capacity, high-bandwidth storage, flexible and high-speed interconnection, and powerful computing power for key workloads.
Zion uses Facebook's new OAM module (OCP Accelerator Module), and Facebook partners such as AMD, Haban, GraphCore and NVIDIA can develop their own solutions on the OCP general specification. Zion's architecture supports the use of overhead switches to extend from each individual platform to multiple servers in a single rack. With the increasing scale and complexity of Facebook AI training, the Zion platform can also be expanded.
Zion system is divided into three parts:
8 slot server
8 Accelerator Platform
OCP Accelerator Module
Zion decouples the system's memory, computing and network-intensive components, allowing each component to expand independently. The system provides 8x NUMA CPU interface and large capacity DDR memory pool for memory-intensive components such as embedded SparseN tables. For memory-bandwidth-intensive and computing-intensive workloads such as CNN or SparseNN-intensive parts, each CPU interface is connected to the OCP acceleration module.
The Zion system has two high-speed structures: a coherent structure connecting all CPUs and a structure connecting all accelerators. In view of the high storage bandwidth but low storage capacity of the accelerator, Facebook hopes to effectively use the available aggregate memory capacity by partitioning the model so that more frequently accessed data resides on the accelerator and less frequently accessed data resides in the DR memory with CPU. The calculation and communication between all CPUs and accelerators are balanced, and they are effectively connected at high and low speeds.
Perform reasoning through Kings Canyon
After training the model, it needs to be deployed in the production environment to process the data of AI process and respond to the user's request, which is called reasoning. The workload of reasoning is increasing dramatically, which reflects a large increase in training. The standard CPU servers currently used can not be well scaled to meet the demand.
Facebook is working with partners such as Esperanto, Intel, Marvell and Qualcomm to develop reasoning ASIC chips that can be deployed and extended on infrastructure. These chips will provide INT8 half-precision operation for the workload, so as to achieve ideal performance, but also support FP16 single-precision operation, so as to achieve higher accuracy.
The whole reasoning server solution is divided into four different parts, which take advantage of existing building blocks that have been published to OCP. Using existing components can speed up development progress and reduce risk through commonality. The four main components of the design are:
Kings Canyon Reasoning M.2 Module
Twin Lakes Single-socket Server
Glacier Point V2 carrier card
Yosemite V2 rack
At the system level, each server consists of M.2 Kings Canyon accelerator and Glacier Point V2 cards connected to Twin Lakes servers. Two sets of components are installed in the updated Yosemite V2 rack and connected to the overhead switch through multi-host NIC. The updated Yosemite sled is an iterative upgrade of Yosemite V2 sled, which connects other PCI-E channels of Twin Lakes host to NIC to achieve higher network bandwidth. Each Kings Canyon module contains ASIC, related memory and other support components, where the CPU host communicates with the accelerator module through the PCI-E channel. Glacier Point V2 includes an integrated PCI-E switch that allows servers to access all modules simultaneously.
Deep learning model is a storage-intensive load. For example, SparseNN model has a very large embedded representation table, which occupies several GB of storage space and may continue to grow. Such a large model may not be suitable for loading into the memory of a single device, either CPU or accelerator, which requires model partitioning on the memory of multiple devices. When the data is in the memory of another device, segmentation will result in a lot of communication costs. A good segmentation algorithm will apply the concept of capturing part to reduce communication costs.
After proper model segmentation, a large-scale in-depth learning model such as SparseNN model can be run. If the memory capacity of a single node is insufficient to support a given model, the model can be further divided between two nodes to increase the amount of memory available in the model. These two nodes can be connected by multi-host NIC and support high-speed information processing. This will increase the overall communication cost and reduce the communication delay by sorting the tables according to the characteristics of access differences across multiple embedded tables.
Neural Network Hardware Accelerator Compiler
ASIC does not run general-purpose code. They require specialized compilers to convert graphics into instructions that can be executed on these accelerators. The goal of the Glow compiler is to abstract vendor-specific hardware from a higher-level software stack so that the infrastructure is not limited by vendors. It accepts computational diagrams from frameworks such as PyTorch 1.0 and generates highly optimized code for these machine learning accelerators.
Video transcoding using Mount Shasta
Since 2016, the average number of live Facebook broadcasts has doubled every year. Since its global launch in August 2018, Facebook Watch has seen more than 400 million monthly views and 75 million people use it every day. To optimize all these videos and adapt them to a variety of network environments, Facebook divides the output quality of various resolutions and bit rates, a process called video transcoding.
The computations needed to complete transcoding are highly intensive, and the efficiency of general purpose processors can no longer meet the growing video needs. In order to get ahead of the demand, Facebook, in collaboration with Bloom and Corewon Microelectronics, designed a custom ASIC to optimize the transcoding workload.
Video transcoding process is decomposed into many different steps. To improve efficiency, Facebook and vendors have created custom ASIC modules for each phase of the transcoding process. Using dedicated hardware to complete these workloads can make the process more efficient and support new functions such as real-time 4K 60fps streaming media. The single video codec is standardized and does not modify frequently, so the lack of flexibility of customized chips is not a significant disadvantage in this case.
The first stage of video transcoding is called decoding. In the decoding process, the uploaded files are decompressed to obtain the original video data represented by a series of images. These uncompressed images are then manipulated to change their resolution, and then encoding them again with optimized settings, re-compressing them into the video stream, and comparing the output video with the original video to calculate the quality indicators.
All videos take this approach to ensure that the encoding settings used can output high-quality videos. The standard used in video coding and decoding is called video coding. H. 264, VP9 and AV1 are the mainstream coding protocols currently used.
On ASIC, except that each software algorithm is replaced by a dedicated module in the chip, the other steps are the same. Facebook hopes that the video accelerator can support multiple resolutions and coding formats, and achieve many times higher efficiency than the current server. The goal is to process at least two 4K 60fps parallel input streams within 10W power consumption.
Video transcoding ASIC usually has the following main logic blocks:
Decoder: Receives the uploaded video and outputs the decompressed original video stream.
Scaler: Changing Video Resolution
Encoder: Output Compressed (Coded) Video
Quality Detection: Computing the Video Quality after Encoding
PHY: Interface between the chip and the outside world, PCI-E and memory channel connected to the server
Controller: A common block that runs firmware and coordinates transcoding processes
Like reasoning, Facebook uses existing OCP building blocks to deploy these video transcoding ASICs in the data center. The accelerator will be installed on the M.2 module which integrates the radiator. This common electrical shape can be used on different hardware platforms. The module is installed on the Glacier Point V2 (GPv2) carrier card, which has the same physical shape as the Twin Lakes server, can accommodate multiple M.2 modules, can be adapted to the Yosemite V2 rack, and paired with the Twin Lakes server.
Because video transcoding ASIC requires low power consumption and small size, Facebook hopes to save costs by connecting as many chips as possible to a single server. High-density GPv2 can achieve this goal, while providing sufficient cooling capacity to withstand the operating temperature of the data center.
After completing software integration, Facebook can distribute video transcoding workload balance to heterogeneous hardware in different data center locations. In order to scale up their collaboration with various machine learning and video space providers, they also strive to ensure that software is developed in an open format and that common interfaces and frameworks are promoted and adopted.
Facebook said in its article that the company will usher in an exciting future and hopes Zion, Kings Canyon and Munt Shasta will be able to purchase solutions to the growing workload of AI training, AI reasoning and video transcoding, respectively. Facebook will publicly provide all design and specifications through OCP, welcome other companies to join in to speed up the process of infrastructure construction, and continue to strive to improve these systems through hardware and software co-design.