Deep studying GPU benchmarks has revolutionized the best way we resolve advanced issues, from picture recognition to pure language processing. Nevertheless, whereas coaching these fashions usually depends on high-performance GPUs, deploying them successfully in resource-constrained environments comparable to edge units or methods with restricted {hardware} presents distinctive challenges. CPUs, being broadly obtainable and cost-efficient, usually function the spine for inference in such eventualities. However how can we make sure that fashions deployed on CPUs ship optimum efficiency with out compromising accuracy?
This text dives into the benchmarking of deep studying mannequin inference on CPUs, specializing in three important metrics: latency, CPU utilization and Reminiscence Utilization. Utilizing a spam classification instance, We discover how well-liked frameworks like PyTorch, TensorFlow, JAX , and ONNX Runtime deal with inference workloads. By the tip, you’ll have a transparent understanding of the right way to measure efficiency, optimize deployments, and choose the best instruments and frameworks for CPU-based inference in resource-constrained environments.
Influence: Optimum inference execution can save a big amount of cash and liberate sources for different workloads.
Studying Aims
- Perceive the position of Deep Studying GPU benchmarks in assessing {hardware} efficiency for AI mannequin coaching and inference.
- Discover ways to make the most of Deep Studying GPU benchmarks to check GPUs and optimize computational effectivity for AI duties.
- Consider PyTorch, TensorFlow, JAX, ONNX Runtime, and OpenVINO Runtime to decide on one of the best to your wants.
- Grasp instruments like
psutil
andtime
to gather correct efficiency information and optimize inference. - Put together fashions, run inference, and measure efficiency, making use of methods to various duties like picture classification and NLP.
- Establish bottlenecks, optimize fashions, and improve efficiency whereas managing sources effectively.
This text was revealed as part of the Information Science Blogathon.
Optimizing Inference with Runtime Acceleration
Inference pace is important for person expertise and operational effectivity in machine studying functions. Runtime optimization performs a key position in enhancing this by streamlining execution. Utilizing hardware-accelerated libraries like ONNX Runtime takes benefit of optimizations tailor-made to particular architectures, decreasing latency (time per inference).
Moreover, light-weight mannequin codecs comparable to ONNX reduce overhead, enabling sooner loading and execution. Optimized runtimes leverage parallel processing to distribute computation throughout obtainable CPU cores and enhance reminiscence administration, making certain higher efficiency particularly on methods with restricted sources. This strategy makes fashions sooner and extra environment friendly whereas sustaining accuracy.
Mannequin Inference Efficiency Metrics
To guage the efficiency of our fashions, we deal with three key metric:
Latency
- Definition : Latency refers back to the time it takes for the mannequin to make a prediction after receiving enter. That is usually measured because the time taken from sending the enter information to receiving the output (prediction)
- Significance : In real-time or near-real-time functions, excessive latency results in delays, which can lead to slower responses.
- Measurement : Latency is often measure in milliseconds (ms) or seconds (s). Shorter latency means the system is extra responsive and environment friendly, essential for functions requiring rapid decision-making or actions.
CPU Utilization
- Definition: CPU Utilization is the proportion of the CPU’s processing energy that’s consumed whereas performing inference duties. It tells you the way a lot of the system’s computational sources are getting used throughout mannequin inference.
- Significance : Excessive CPU utilization signifies that the machine would possibly wrestle to deal with different duties concurrently, resulting in bottlenecks. Environment friendly use of CPU sources ensures that the mannequin inference doesn’t monopolize the system sources.
- Measurement : It’s sometimes measured as a proportion (%) of the overall obtainable CPU sources. Decrease utilization for a similar workload usually signifies a extra optimized mannequin, using CPU sources extra successfully.
Reminiscence Utilization
- Definition: Reminiscence utilization refers back to the quantity of RAM utilized by the mannequin through the inference course of. It tracks the reminiscence consumption by the mannequin’s parameters, intermediate computations, and the enter information.
- Significance : Optimizing reminiscence utilization is very important when deploying fashions to edge units or methods whith restricted reminiscence. Excessive reminiscence consumption might result in reminiscence overfloe, slower processing, or system crashes.
- Measurement: Reminiscence utilization is measure in megabytes (MB) or gigabytes (GB). Monitoring the reminiscence consumption at totally different phases of inference may also help establish reminiscence inefficiencies or reminiscence leaks.
Assumptions and Limitations
To maintain this benchmarking examine centered and sensible, we made the next assumptions and set a number of boundaries:
- {Hardware} Constraints: The assessments are designed to run on a single machine with restricted CPU cores. Whereas fashionable {hardware} is able to dealing with parallel workloads, this setup mirrors the constraints usually seen in edge units or smaller-scale deployments.
- No Multi-System Parallelization: We didn’t incorporate distributed computing setups or cluster-based options. The benchmarks replicate efficiency standalone circumstances, appropriate for single-node environments with restricted CPU cores and Reminiscence.
- Scope:The first focus is barely on CPU inference efficiency. Whereas GPU-based inference is a wonderful possibility for resource-intensive duties, this benchmarking goals to supply insights into CPU-only setups, that are extra frequent in cost-sensitive or transportable functions.
These assumptions make sure the benchmarks stay related for builders and groups working with resource-constrained {hardware} or who want predictable efficiency with out the added complexity of distributed methods.
We’ll discover the important instruments and frameworks used to benchmark and optimize deep studying mannequin inference on CPUs, offering insights into their capabilities for environment friendly execution in resource-constrained environments.
Profiling Instruments
- Python Time (time library) : The time library in Python is a light-weight device for measuring the execution time of code blocks. By recording the beginning and finish time stamps, it helps calculate the time taken for operations like mannequin inference or information processing.
- psutil (CPU, Reminiscence Profiling) : psutil is a Python library for sustem monitoring and profiling. It supplies real-time information on CPU utilization, reminiscence consumption, disk I/O and extra, making it supreme for analyzing utilization throughout mannequin coaching or inference.
Frameworks for Inference
- TensorFlow : A strong framework for deep studying that’s broadly used for each coaching and inference duties. It presents sturdy help for numerous fashions and deployment methods.
- PyTorch: Identified for its ease of use and dynamic computation graphs, PyTorch is a well-liked alternative for analysis and manufacturing deployment.
- ONNX Runtime: An open-source , cross-platform engine for working ONXX(Open Neural Community Alternate) fashions, offering environment friendly inference throughout numerous {hardware} and frameworks.
- JAX : A practical framework centered on high-performance numerical computing and machine studying, providing computerized differentiation and GPU/TPU acceleration.
- OpenVINO: Optimized for Intel {hardware}, OpenVINO supplies instruments for mannequin optimization and deployment on Intel CPUs, GPUs and VPUs.
{Hardware} Specification and Surroundings
We’re using github codespace (digital machine) with beneath configuration:
- Specification of Digital Machine: 2 cores, 8 GB RAM, and 32 GB storage
- Python Model: 3.12.1
Set up Dependencies
The variations of the packages used are as follows and this major embrace 5 deep studying inference libraries: Tensorflow, Pytorch, ONNX Runtime, JAX, and OpenVINO:
!pip set up numpy==1.26.4
!pip set up torch==2.2.2
!pip set up tensorflow==2.16.2
!pip set up onnx==1.17.0
!pip set up onnxruntime==1.17.0!pip set up jax==0.4.30
!pip set up jaxlib==0.4.30
!pip set up openvino==2024.6.0
!pip set up matplotlib==3.9.3
!pip set up Matplotlib: 3.4.3
!pip set up Pillow: 8.3.2
!pip set up psutil: 5.8.0
Downside Assertion and Enter Specification
Since mannequin inference consists of performing a number of matrix operations between community weights and enter information, it doesn’t require mannequin coaching or datasets. For our instance the benchmarking course of, we simulated a regular classification use case. This simulates frequent binary classification duties like spam detection and mortgage utility selections(approval or denial). The binary nature of those issues makes them supreme for evaluating mannequin efficiency throughout totally different frameworks. This setup displays real-world methods however permits us to deal with inference efficiency throughout frameworks with no need giant datasets or pre-trained fashions.
Downside Assertion
The pattern activity entails predicting whether or not a given pattern is spam or not (mortgage approval or denial), based mostly on a set of enter options. This binary classification downside is computationally environment friendly, permitting for a centered evaluation of inference efficiency with out the complexity of multi-class classification duties.
Enter Specification
To simulate real-world e mail information, we generated randomly enter. These embeddings mimic the kind of information that may be processed by spam filters however keep away from the necessity for exterior datasets. This simulated enter information permits for benchmarking with out counting on any particular exterior datasets, making it supreme for testing mannequin inference instances, reminiscence utilization, and CPU efficiency. Alternatively, you should use picture classification, NLP activity or every other deep studying duties to carry out this benchmarking course of.
Fashions Structure and Codecs
Mannequin choice is a important step in benchmarking because it immediately influences the inference efficiency and insights gained from the profiling course of. As talked about within the earlier part, for this benchmarking examine, we selected a regular Classification use case, which entails figuring out whether or not a given e mail is spam or not. This activity is an easy two-class classification downside that’s computationally environment friendly but supplies significant outcomes for comparability throughout frameworks.
Fashions Structure for Benchmarking
The mannequin for the Classification activity is a Feedforward Neural Community (FNN) designed for binary classification (Spam vs. Not Spam). It consists of the next layers:
- Enter Layer : Accepts a vector of measurement 200(embedding options). Now we have offered instance of PyTorch, different frameworks observe the very same community configuration
self.fc1 = torch.nn.Linear(200,128)
- Hidden Layers : The community has 5 hidden layers, with every successive layer containing fewer models than the earlier one.
self.fc2 = torch.nn.Linear(128, 64)
self.fc3 = torch.nn.Linear(64, 32)
self.fc4 = torch.nn.Linear(32, 16)
self.fc5 = torch.nn.Linear(16, 8)
self.fc6 = torch.nn.Linear(8, 1)
- Output Layers : A single neuron with a Sigmoid activation operate to output a likelihood (0 for Not Spam, 1 for Spam). Now we have utilized sigmoid layer as remaining output for binary classification.
self.sigmoid = torch.nn.Sigmoid()
The mannequin is easy but efficient for classification activity.
The mannequin structure diagram used for benchmarking in our use case is proven beneath:
Examples of Extra Networks for Benchmarking
- Picture Classification : Fashions like ResNet-50 (medium complexity) and MobileNet (light-weight) may be added to the benchmark suite for duties involving picture recognition. ResNet-50 presents a steadiness between computational complexity and accuracy, whereas MobileNet is optimized for low-resource environments.
- NLP Duties : DistilBERT: A smaller, sooner variant of the BERT mannequin, suited to pure language understanding duties.
Mannequin Codecs
- Native Codecs: Every framework helps its native mannequin codecs, comparable to .pt for PyTorch and .h5 for TensorFlow.
- Unified Format (ONNX): To make sure compatibility throughout frameworks, We exported the PyTorch mannequin to the ONNX format (mannequin.onnx). ONNX (Open Neural Community Alternate) acts as a bridge, enabling fashions for use in different frameworks like PyTorch, TensorFlow, JAX, or OpenVINO with out important modifications. That is particularly helpful for multi-framework testing and real-world deployment eventualities, the place interoperability is important.
- These codecs are optimized for his or her respective frameworks, making them straightforward to save lots of, load, and deploy inside these ecosystems.
Benchmarking Workflow
This workflow goals to check the inference efficiency of a number of deep studying frameworks (TensorFlow, PyTorch, ONNX, JAX, and OpenVINO) utilizing the classification activity. The duty entails utilizing randomly generated enter information and benchmarking every framework to measure the common time taken for a prediction.
- Import python packages
- Disable GPU utilization and suppress Tensorflow Logging
- Enter information preparation
- Mannequin Implementations for every framework
- Benchmarking operate definition
- Mannequin Inference and Benchmarking execution for every framework
- Visualization and export of Benchmarking Outcomes
Import Needed Python Packages
To get began with benchmarking deep studying fashions, we first have to import the important Python packages that allow seamless integration and efficiency analysis.
import time
import os
import numpy as np
import torch
import tensorflow as tf
from tensorflow.keras import Enter
import onnxruntime as ort
import matplotlib.pyplot as plt
from PIL import Picture
import psutil
import jax
import jax.numpy as jnp
from openvino.runtime import Core
import csv
Disable GPU Utilization and Suppress TensorFlow Logging
os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # Disable GPU
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" #Suppress Tensorflow Log
Enter Information Preparation
On this step, we randomly generate enter information for spam classification:
- Dimensionality of a pattern (200-dimesnional options)
- The variety of courses (2: Spam or Not Spam)
We generate randome information utilizing NumPy to function enter options for the fashions.
#Generate dummy information
input_data = np.random.rand(1000, 200).astype(np.float32)
Mannequin Definition
On this step, we outline the netwrok structure or setup the mannequin from every deep studying framework( Tensorflow, PyTorch, ONNX, JAX and OpenVINO). Every framework requires a selected strategies for loading fashions and setting them up for inference.
- PyTorch Mannequin: In PyTorch, we outline a easy neural neural community structure with 5 totally related layers.
- Tensorflow Mannequin : The TensorFlow mannequin is outlined utilizing the Keras API and consists of a easy feedforward neural community for the classification activity.
- JAX Mannequin: The mannequin is initialized with parameters, and the prediction operate is compiled utilizing JAX’s Simply-in-Time (JIT) compilation for environment friendly execution.
- ONNX Mannequin: For ONNX, we export a mannequin from PyTorch. After exporting to the ONNX format, we load the mannequin utilizing the onnxruntime. InferenceSession API. This enables us to run inference on the mannequin throughout totally different {hardware} specification.
- OpenVINO Mannequin: OpenVINO is used for working optimized and deploying fashions, notably these skilled utilizing different frameworks (like PyTorch or TensorFlow). We load the ONNX mannequin and compile it with OpenVINO’s runtime.
Pytorch
class PyTorchModel(torch.nn.Module):
def __init__(self):
tremendous(PyTorchModel, self).__init__()
self.fc1 = torch.nn.Linear(200, 128)
self.fc2 = torch.nn.Linear(128, 64)
self.fc3 = torch.nn.Linear(64, 32)
self.fc4 = torch.nn.Linear(32, 16)
self.fc5 = torch.nn.Linear(16, 8)
self.fc6 = torch.nn.Linear(8, 1)
self.sigmoid = torch.nn.Sigmoid()
def ahead(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = torch.relu(self.fc3(x))
x = torch.relu(self.fc4(x))
x = torch.relu(self.fc5(x))
x = self.sigmoid(self.fc6(x))
return x
# Create PyTorch mannequin
pytorch_model = PyTorchModel()
TensorFlow
tensorflow_model = tf.keras.Sequential([
Input(shape=(200,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(8, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
tensorflow_model.compile()
Jax
def jax_model(x):
x = jax.nn.relu(jnp.dot(x, jnp.ones((200, 128))))
x = jax.nn.relu(jnp.dot(x, jnp.ones((128, 64))))
x = jax.nn.relu(jnp.dot(x, jnp.ones((64, 32))))
x = jax.nn.relu(jnp.dot(x, jnp.ones((32, 16))))
x = jax.nn.relu(jnp.dot(x, jnp.ones((16, 8))))
x = jax.nn.sigmoid(jnp.dot(x, jnp.ones((8, 1))))
return x
ONNX
# Convert PyTorch mannequin to ONNX
dummy_input = torch.randn(1, 200)
onnx_model_path = "mannequin.onnx"
torch.onnx.export(
pytorch_model,
dummy_input,
onnx_model_path,
export_params=True,
opset_version=11,
input_names=['input'],
output_names=['output'],
dynamic_axes={'enter': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)
onnx_session = ort.InferenceSession(onnx_model_path)
OpenVINO
# OpenVINO Mannequin Definition
core = Core()
openvino_model = core.read_model(mannequin="mannequin.onnx")
compiled_model = core.compile_model(openvino_model, device_name="CPU")
Benchmarking Perform Definiton
This operate executes benchmarking assessments throughout totally different frameworks by taking three arguments: predict_function, input_data, and num_runs. By default, it executes 1,000 instances however It may be elevated as per necessities.
def benchmark_model(predict_function, input_data, num_runs=1000):
start_time = time.time()
course of = psutil.Course of(os.getpid())
cpu_usage = []
memory_usage = []
for _ in vary(num_runs):
predict_function(input_data)
cpu_usage.append(course of.cpu_percent())
memory_usage.append(course of.memory_info().rss)
end_time = time.time()
avg_latency = (end_time - start_time) / num_runs
avg_cpu = np.imply(cpu_usage)
avg_memory = np.imply(memory_usage) / (1024 * 1024) # Convert to MB
return avg_latency, avg_cpu, avg_memory
Mannequin Inference and Carry out Benchmarking for Every Framework
Now that we’ve loaded the fashions, it’s time to benchmark the efficiency of every framework. The benchmarking course of carry out inference on the generated enter information.
PyTorch
# Benchmark PyTorch mannequin
def pytorch_predict(input_data):
pytorch_model(torch.tensor(input_data))
pytorch_latency, pytorch_cpu, pytorch_memory = benchmark_model(lambda x: pytorch_predict(x), input_data)
TensorFlow
# Benchmark TensorFlow mannequin
def tensorflow_predict(input_data):
tensorflow_model(input_data)
tensorflow_latency, tensorflow_cpu, tensorflow_memory = benchmark_model(lambda x: tensorflow_predict(x), input_data)
JAX
# Benchmark JAX mannequin
def jax_predict(input_data):
jax_model(jnp.array(input_data))
jax_latency, jax_cpu, jax_memory = benchmark_model(lambda x: jax_predict(x), input_data)
ONNX
# Benchmark ONNX mannequin
def onnx_predict(input_data):
# Course of inputs in batches
for i in vary(input_data.form[0]):
single_input = input_data[i:i+1] # Extract single enter
onnx_session.run(None, {onnx_session.get_inputs()[0].identify: single_input})
onnx_latency, onnx_cpu, onnx_memory = benchmark_model(lambda x: onnx_predict(x), input_data)
OpenVINO
# Benchmark OpenVINO mannequin
def openvino_predict(input_data):
# Course of inputs in batches
for i in vary(input_data.form[0]):
single_input = input_data[i:i+1] # Extract single enter
compiled_model.infer_new_request({0: single_input})
openvino_latency, openvino_cpu, openvino_memory = benchmark_model(lambda x: openvino_predict(x), input_data)
Outcomes and Dialogue
Right here we talk about the outcomes of efficiency benchmarking of beforehand talked about deep studying frameworks. We examine them on – latency, CPU utilization, and reminiscence utilization. Now we have included tabular information and plot for fast comparability.
Latency Comparability
Framework | Latency (ms) | Relative Latency (vs. PyTorch) |
PyTorch | 1.26 | 1.0 (baseline) |
TensorFlow | 6.61 | ~5.25× |
JAX | 3.15 | ~2.50× |
ONNX | 14.75 | ~11.72× |
OpenVINO | 144.84 | ~115× |
Insights:
- PyTorch leads because the quickest framework with ~1.26 ms latency.
- TensorFlow has ~6.61 ms latency, about 5.25× PyTorch’s time.
- JAX sits between PyTorch and TensorFlow in absolute latency.
- ONNX is comparatively gradual as nicely, at ~14.75 ms.
- OpenVINO is the slowest on this experiment, at ~145 ms (115× slower than PyTorch).
CPU Utilization
Framework | CPU Utilization (%) | Relative CPU Utilization<sup>1</sup> |
PyTorch | 99.79 | ~1.00 |
TensorFlow | 112.26 | ~1.13 |
JAX | 130.03 | ~1.31 |
ONNX | 99.58 | ~1.00 |
OpenVINO | 99.32 | 1.00 (baseline) |
Insights:
- JAX makes use of essentially the most CPU (~130 %), ~31% increased than OpenVINO.
- TensorFlow is at ~112 %, greater than PyTorch/ONNX/OpenVINO however nonetheless decrease than JAX.
- PyTorch, ONNX, and OpenVINO, all have comparable, ~99-100% CPU utilization.
Reminiscence Utilization
Framework | Reminiscence (MB) | Relative Reminiscence Utilization (vs. PyTorch) |
PyTorch | ~959.69 | 1.0 (baseline) |
TensorFlow | ~969.72 | ~1.01× |
JAX | ~1033.63 | ~1.08× |
ONNX | ~1033.82 | ~1.08× |
OpenVINO | ~1040.80 | ~1.08–1.09× |
Insights:
- PyTorch and TensorFlow have comparable reminiscence utilization round ~960-970 MB
- JAX, ONNX, and OpenVINO use round ~1,030–1,040 MB of reminiscence, roughly 8–9% greater than PyTorch.
Right here is the plot evaluating the Efficiency of Deep Studying Frameworks:
Conclusion
On this article, we introduced a complete benchmarking workflow to judge the inference efficiency of outstanding deep studying frameworks—TensorFlow, PyTorch, ONNX, JAX, and OpenVINO—utilizing a spam classification activity as a reference. By analyzing key metrics comparable to latency, CPU utilization and reminiscence consumption, the outcomes highlighted the trade-offs between frameworks and their suitability for various deployment eventualities.
PyTorch demonstrated essentially the most balanced efficiency, excelling in low latency and environment friendly reminiscence utilization, making it supreme for latency-sensitive functions like real-time predictions and suggestion methods. TensorFlow offered a middle-ground resolution with reasonably increased useful resource consumption. JAX showcased excessive computational throughput however at the price of elevated CPU utilization, which may be a limiting issue for resource-constrained environments. In the meantime, ONNX and OpenVINO lagged in latency, with OpenVINO’s efficiency notably hindered by the absence of {hardware} acceleration.
These findings underline the significance of aligning framework choice with deployment wants. Whether or not optimizing for pace, useful resource effectivity, or particular {hardware}, understanding the trade-offs is important for efficient mannequin deployment in real-world environments.
Key Takeaways
- Deep Studying GPU Benchmarks present important insights into GPU efficiency, aiding in deciding on optimum {hardware} for AI duties.
- Leveraging Deep Studying GPU Benchmarks ensures environment friendly mannequin coaching and inference by figuring out high-performing GPUs.
- Achieved one of the best latency (1.26 ms) and maintained environment friendly reminiscence utilization, supreme for real-time and resource-limited functions.
- Balanced latency (6.61 ms) with barely increased CPU utilization, appropriate for duties requiring average efficiency compromises.
- Delivered aggressive latency (3.15 ms) however at the price of extreme CPU utilization (130%), limiting its utility in constrained setups.
- Confirmed increased latency (14.75 ms), however its cross-platform help makes it versatile for multi-framework deployments.
Regularly Requested Questions
A. PyTorch’s dynamic computation graph and environment friendly execution pipeline enable for low-latency inference (1.26 ms), making it well-suited for functions like suggestion methods and real-time predictions.
A. OpenVINO’s optimizations are designed for Intel {hardware}. With out this acceleration, its latency (144.84 ms) and reminiscence utilization (1040.8 MB) have been much less aggressive in comparison with different frameworks.
A. For CPU-only setups, PyTorch is essentially the most environment friendly. TensorFlow is a powerful various for average workloads. Keep away from frameworks like JAX except increased CPU utilization is appropriate.
A. Framework efficiency relies upon closely on {hardware} compatibility. As an illustration, OpenVINO excels on Intel CPUs with hardware-specific optimizations, whereas PyTorch and TensorFlow carry out constantly throughout assorted setups.
A. Sure, these outcomes replicate a easy binary classification activity. Efficiency might fluctuate with advanced architectures like ResNet or duties like NLP or others, the place these frameworks would possibly leverage specialised optimizations.
The media proven on this article will not be owned by Analytics Vidhya and is used on the Creator’s discretion.