In this paper, the authors address the issue of poor scalability of machine learning models when deployed on CPUs. They propose a novel approach based on the Divide-and-Conquer Principle to tackle this problem. Instead of allocating all available computing resources to the entire problem, they suggest breaking it into smaller chunks and letting the framework decide how computing resources should be allocated among those chunks. The authors argue that in many use cases, such a division is natural and requires only trivial changes in user code. The proposed allocation mechanism is implemented in OnnxRuntime, a popular framework for training and inferencing ML models. The inference API is extended to allow user code to invoke parallel inference on multiple inputs. The effectiveness of this approach is demonstrated with several use cases, including highly popular models for image processing (PaddleOCR) and NLP tasks (BERT). In Section 2, the authors elaborate on various reasons why inference commonly does not scale well on CPUs. One reason is that the amount of computation required by a model during inference may not be "enough" for efficient parallelization. In Section 3, they describe in detail the concept and implementation details of their proposed Divide-and-Conquer Principle as it applies to inference. Section 4 presents several use cases where this principle can be applied along with performance evaluation results demonstrating its benefits. For instance, their approach allows efficient batching of inference requests of various sizes eliminating the need for padding and letting the framework allocate computing resources proportionally to the length of each sequence. In Section 5, related work is discussed before concluding in Section 6. Overall, this paper provides an insightful solution to address poor scalability issues faced by machine learning models when deployed on CPUs using a simple yet effective approach based on Divide-and-Conquer Principle.
- - The paper addresses the issue of poor scalability of machine learning models when deployed on CPUs.
- - The authors propose a novel approach based on the Divide-and-Conquer Principle to tackle this problem.
- - Instead of allocating all available computing resources to the entire problem, they suggest breaking it into smaller chunks and letting the framework decide how computing resources should be allocated among those chunks.
- - The proposed allocation mechanism is implemented in OnnxRuntime, a popular framework for training and inferencing ML models.
- - The effectiveness of this approach is demonstrated with several use cases, including highly popular models for image processing (PaddleOCR) and NLP tasks (BERT).
- - Section 2 elaborates on various reasons why inference commonly does not scale well on CPUs.
- - In Section 3, the authors describe in detail the concept and implementation details of their proposed Divide-and-Conquer Principle as it applies to inference.
- - Section 4 presents several use cases where this principle can be applied along with performance evaluation results demonstrating its benefits.
- - Their approach allows efficient batching of inference requests of various sizes eliminating the need for padding and letting the framework allocate computing resources proportionally to the length of each sequence.
- - Related work is discussed in Section 5 before concluding in Section 6.
The paper talks about a problem with machine learning models not working well on regular computers. The authors suggest breaking the problem into smaller parts and letting the computer decide how to use its resources for each part. They tested this idea on popular image and language models and it worked well. They explain their idea in detail in Section 3 and show examples of how it can be used in Section 4. This approach allows for efficient use of computing resources without needing extra padding.
Definitions:
- Machine learning models: computer programs that can learn from data and make predictions or decisions based on that data
- Scalability: the ability to handle larger amounts of work or data without losing performance
- CPUs: central processing units, the main component of a computer that performs most of its processing tasks
- Divide-and-Conquer Principle: a strategy where a big problem is broken down into smaller, more manageable problems
- OnnxRuntime: a software framework used for training and running machine learning models
Scalability of Machine Learning Models on CPUs: A Divide-and-Conquer Approach
The scalability of machine learning models when deployed on CPUs is a major issue that has been plaguing the field for some time. In this paper, the authors propose a novel approach based on the Divide-and-Conquer Principle to tackle this problem. This approach is implemented in OnnxRuntime, a popular framework for training and inferencing ML models. The effectiveness of this approach is demonstrated with several use cases, including highly popular models for image processing (PaddleOCR) and NLP tasks (BERT).
Background
When deploying machine learning models on CPUs, there are several issues that can lead to poor scalability. One reason is that the amount of computation required by a model during inference may not be "enough" for efficient parallelization. Furthermore, existing approaches often require manual tuning or padding to achieve optimal performance which can be difficult and time consuming.
Proposed Solution
The authors propose an alternative solution based on the Divide-and-Conquer Principle which allows users to break down their problem into smaller chunks and let the framework decide how computing resources should be allocated among those chunks. This requires only trivial changes in user code as it relies heavily on natural divisions within many use cases such as batching requests of various sizes without requiring padding or manual tuning.
To implement this approach in OnnxRuntime, they extended its inference API to allow user code to invoke parallel inference on multiple inputs using their proposed allocation mechanism. Performance evaluation results demonstrate its benefits compared to existing methods such as improved throughput when dealing with large batches due to better resource utilization and elimination of padding overhead when dealing with small batches due to dynamic resource allocation across different sequences lengths.
Related Work
In Section 5, related work is discussed before concluding in Section 6 where they summarize their findings and discuss future directions for research in this area such as exploring other applications beyond image processing and NLP tasks where their proposed solution could be applied effectively.
Conclusion
Overall, this paper provides an insightful solution to address poor scalability issues faced by machine learning models when deployed on CPUs using a simple yet effective approach based on Divide-and-Conquer Principle implemented in OnnxRuntime framework which allows users to break down their problems into smaller chunks while letting the framework decide how computing resources should be allocated among those chunks resulting in improved throughput when dealing with large batches due to better resource utilization and elimination of padding overhead when dealing with small batches due to dynamic resource allocation across different sequences lengths compared with existing methods..