Home > News content

The domestic AI framework evolved again! Baidu Paddle Lite released

via:博客园     time:2019/8/21 15:28:26     readed:89

Ganming, the policy, from the concave temple

Quotation Report Public Number QbitAI

The domestic AI frame flying paddle has just brought a new evolution: Paddle Lite is officially released!

Highly scalable, high-performance, and lightweight, it is also the first deep-learning end-side reasoning framework to support Huawei's NPU online compilation. The sword refers to intensifying efforts to capture the mobile side scene.

And the big environment is like this, and the progress of such a self-developed basic framework has also become more and more. Integrity and strength are also commendable. One of the core highlights of Paddle Lite is the support for broader, more heterogeneous AI hardware.

With this upgrade, Paddle Lite's architecture has been significantly upgraded, and it is more complete with support for multiple hardware, multiple platforms, and hardware hybrid scheduling. It not only covers mobile CPU chips such as ARM CPU, Mali GPU, Adreno GPU, Huawei NPU, but also supports common hardware such as FPGAs, and has the ability to be compatible with mainstream chips in the cloud.

Among them, Paddle Lite has become the first deep learning reasoning framework for online compilation of Huawei NPU. Earlier, Baidu and Huawei announced a strong alliance at the AI ​​Developer Conference.

It is worth mentioning that, in addition to the TensorFlow Lite launched by Google, the upgraded Paddle Lite is directly reinforcing the deficiencies of the former.

The official said that not only supports a wider range of AI hardware terminals, but also enhances the universality of deployment and has obvious performance advantages. The competition in the AI ​​framework has become more intense and has entered a new phase.

What is Paddle Lite?

Paddle Lite, an evolution of Paddle Mobile, is an inference engine for high-performance lightweight deployments.

The core purpose is to rapidly deploy the trained model in different hardware platform scenarios, perform predictive reasoning based on input data to obtain calculation results, and support actual business applications. In the AI ​​technology landing, the reasoning phase is related to the actual application, which is directly related to the user experience, which is a very challenging part.

What is more challenging is that the hardware that carries the reasoning is becoming more and more isomerized. The cloud, mobile and edge end correspond to different hardware, and the underlying chip architecture varies greatly.

How can I fully support such a large number of hardware architectures and optimize the performance of artificial intelligence applications on these hardwares to achieve faster speeds? The solution given by Paddle Lite is:

Through the new architecture, high scalability and high flexibility to model the underlying computing model, the ability to perform a variety of hardware, quantization methods, and Data Layout hybrid scheduling execution is enhanced, thereby ensuring the support capabilities of the macro hardware and the ultimate underlying optimization. Achieve leading model application performance effects.

Paddle Lite five characteristics

According to the official introduction, Paddle Lite has five characteristics: high scalability, seamless integration of training reasoning, versatility, high performance and light weight.

1, high scalability.

The new architecture has a stronger ability to describe hardware abstractions, and it is easy to integrate new hardware under a set of frameworks, which is very scalable. For example, extended support for FPGAs is very simple.

In addition, with reference to LLVM's Type System and MIR (Machine IR), the hardware and model can be analyzed and optimized in a more detailed manner, which makes it easier and more efficient to extend the optimization strategy and provide unlimited possibilities.

At present, Paddle Lite has supported 21 Pass optimization strategies, including hardware computing mode hybrid scheduling, INT8 quantization, operator fusion, redundant computing cropping and other kinds of optimization.

2. The training reasoning is seamless.

Different from other independent inference engines, Paddle Lite relies on the flying paddle training framework and its corresponding rich and complete operator library. The calculation logic of the underlying operator is strictly consistent with the training. The model is fully compatible with no risk and can support more quickly. model.

It is connected to the PaddleSlim model compression tool of the flying paddles, directly supporting the INT8 Quantitative Training model and achieving better accuracy than offline quantization.

3. Generality.

The officially released benchmark of 18 models covers image classification, detection, segmentation and image text recognition. It corresponds to 80 operators Op+85 Kernel, and related operators can support other models.

Moreover, it is also compatible with models that support other framework training. For Caffe and TensorFlow trained models, it can be predicted by the matching X2Paddle tool.

Now, Paddle Lite has been integrated with the PaddleSlim model compression tool of the Flying Paddle to directly support the INT8 Quantitative Training model and achieve better accuracy than offline quantization.

Support for multiple hardware, currently supported by ARM CPU, Mali GPU, Adreno GPU, Huawei NPU, FPGA, etc., are supporting AI chips such as Cambrian, Bit Continental, and other hardware support in the future.

In addition, a web front-end development interface is provided, which supports javascript to call the GPU, which can quickly run a deep learning model on the web page.

4, high performance.

Excellent performance on ARM CPUs. For the different micro-architectures, the kernel is deeply optimized, showing the speed advantage on the mainstream mobile model.

In addition, Paddle Lite also supports INT8 quantization calculations, through the optimization of the framework layer and the underlying efficient quantitative calculation, combined with the INT8 quantization training function in the PaddleSlim model compression tool, can provide high-precision and high-performance prediction capabilities.

In Huawei NPU, FPGA also has good performance.

5. Lightweight.

Deep customization and optimization for end-device features without any third-party dependencies. The whole reasoning process is divided into model loading analysis, optimization analysis of calculation graphs and efficient operation on equipment. The mobile can directly deploy the optimized analysis graph and perform prediction.

On the Android platform, the ARMV7 dynamic library only needs 800k, and the ARMV8 dynamic library only has 1.3M. It can also be more deeply tailored as needed.

At present, Paddle Lite and its predecessor related technologies have been widely used in Baidu App, Baidu Map, Baidu Net Disk and Autopilot. For example, Baidu App recently introduced real-time dynamic multi-target recognition. With the support of Paddle Lite, the original cloud multi-layer visual algorithm model was optimized to 10 layers, and the object was identified within 100ms, and the object position tracking update was made within 8ms.

In contrast, human eyes recognize objects generally, from 170ms to 400ms, and tracking objects need to be refreshed for about 40ms, which means that their recognition speed has exceeded the human eye.

And to achieve this, thanks to Paddle Lite's powerful end-side reasoning ability, it can perfectly undertake the efficient deployment of the flying paddle on multiple hardware platforms, and achieve the ultimate performance optimization of the model application.

New architecture details

Backed by Baidu, Paddle Lite's architecture has a series of independent research and development technologies.

According to reports, Paddle Lite refers to Baidu's internal multiple predictive library architecture implementation, as well as the integration of superior capabilities, and focuses on the complete design of hybrid computing with multiple computing modes (hardware, quantization method, Data Layout). The new architecture is designed as follows:

The top layer isModel layerThe model that directly accepts Paddle training is transformed into a NaiveBuffer special format through the model optimization tool to better adapt to the mobile deployment scenario.

The second layer isProgram layer, is the execution program of the operator sequence.

The third floor is a completeAnalysis moduleIt includes the MIR (Machine IR) related module, which can perform various optimizations such as operator fusion and calculation cropping on the calculation chart of the original model for the specific hardware list.

Unlike IR (Internal Representation) in the paddle training process, hardware and execution information is also added to the analysis at this level.

The bottom layer isExecutive layer, that is, a Runtime Program consisting of a Kernel sequence. The execution layer's framework scheduling framework is extremely low, involving only Kernel execution, and can be deployed separately to support the ultimate lightweight deployment.

On the whole, not only the support for multiple hardware and platform is emphasized, but also the ability of multiple hardware to perform mixed execution in one model, performance optimization processing at multiple levels, and lightweight design for end-side applications.

The rise of domestic deep learning framework

The evolution of PaddlePaddle is more than just a simple product upgrade. In the big trend and the big environment, the meaning is becoming different.

On the one hand, it is a big trend.

This year is an important year for AI to be grounded. Domestic AI hardware, AI hardware research and development, including Baidu, Ali, Huawei and other giant companies are actively designing and manufacturing AI chips.

The rapid development of hardware can not make up for the lack of software, foreign technology giants have stepped up and want to occupy this market blank.

At this year's TensorFlow Developers Conference, Google has focused on TensorFlow Lite, which deploys AI applications on the edge. Obviously, this framework does not currently adapt well to the various hardware developed by domestic companies.

Foreign technology companies will not spend a lot of energy on domestic chips of many different manufacturers and different architectures. So the flying paddle saw the opportunity and saw initial results. According to the Q2 financial report released by Baidu, the developer download volume of the flying paddle increased by 45% in the second quarter of 2019.

As the most popular domestic machine learning framework, Paddle Lite has spent a lot of effort to solve the problem of small application and difficult development of domestic AI hardware.

On the other hand, the big topic that cannot be avoided.

Compared with the past, the independent research and development and the worry of uninterrupted supply in the development of AI have also been repeatedly talked about. In addition to the patent, hardware, and underlying algorithm framework, it was also put on the desktop after Android was discontinued for Huawei. Currently, two deep learning frameworks, TensorFlow and PyTorch, are open source projects, but they are all under the control of US companies and may have to “observe US law”.

So there is no risk of not having a "card neck".

Before, how to develop such a topic of the underlying core technology, experts from all sides talked about it and eagerly appealed, but it is still difficult to become a real action. Not only does it require time, talent, resources, and other aspects of investment, but also the right time —— at least not when it is difficult to return.

So the upgrade of Paddle Lite is just the right time. One has accumulated, and the second is not too late, you can change lanes and overtake.

However, to say a thousand and ten thousand, I finally try to get the most direct. Not much to say, we look at the goods, inspection:


For this release of Paddle Lite, the key feature upgrades are summarized as follows:

  1. A major upgrade of the architecture, through the addition of Machine IR, Type system, lightweight Operator and Kernel, etc., increased common multi-platform and multi-hardware support, multiple precision and data layout mixed scheduling execution, dynamic optimization, lightweight deployment and other important features.
  2. Completed the Java API and corresponded to the C++ API.
  3. The NaiveBuffer model storage format is added, and the mobile deployment is decoupled from protobuf to make the prediction library size smaller.
  4. Support for Caffe and TensorFlow models is supported by X2Paddle, and currently 6 models are supported for conversion support.
  5. Added deep support for Huawei HiSteen NPU and became the first framework to support Huawei NPU online compilation.
  6. Support for FPGA, has verified the ResNet50 model.
  7. For Mali GPU and Adreno GPU, support for OpenCL and ARM CPU Kernel hybrid scheduling, has verified the effects on models such as MobileNetV1, MobileNetV2, ResNet-50.
  8. For ARM-based CPUs, Paddle Lite adds support for common models such as vgg-16, EfficientNet-b0, and ResNet-18.
  9. Added 70 kinds of hardware Kernel.

Official website address:


project address:


China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments