I not too long ago got here throughout a submit by Sebastian that caught my consideration, and I wished to dive deeper into its content material. As fashions develop bigger and extra complicated, effectively managing reminiscence throughout mannequin loading turns into more and more vital, particularly when working with restricted GPU or CPU sources. In his submit, Sebastian covers sensible ideas for loading bigger pre-trained or fine-tuned fashions in constrained reminiscence environments, which is especially related when working with PyTorch.
This information emphasizes easy methods to deal with conditions the place fashions are saved utilizing torch.save(mannequin.state_dict(), "mannequin.pth")
and later must be loaded for continued pre-training or additional fine-tuning. Whereas the examples concentrate on a big language mannequin (LLM), Sebastian’s strategies are broadly relevant to any PyTorch mannequin. Moreover, they supply beneficial insights into memory-efficient mannequin weight loadingy in PyTorch, serving to optimize reminiscence utilization through the loading course of.
Overview
- Environment friendly reminiscence administration is essential for loading giant neural networks in PyTorch, particularly on techniques with restricted GPU or CPU sources.
- As an alternative of loading the whole mannequin without delay, you’ll be able to load weights incrementally.Usually, calling
mannequin.to(machine)
strikes all of the mannequin’s parameters to the machine (like a GPU), which may eat vital reminiscence. - PyTorch launched the “meta” machine, which permits for the creation of tensors with out utilizing reminiscence.
- By using the meta machine, you’ll be able to load weights immediately into GPU reminiscence, bypassing the CPU and optimizing reminiscence utilization.
Preliminary Setup: Atmosphere Examine
Earlier than diving into the specifics, let’s be certain that the required packages and variations can be found. Right here’s a snippet that checks for the model of PyTorch and different helpful instruments.
from importlib.metadata import model
pkgs = [
"torch",
]
for p in pkgs:
print(f"{p} model: {model(p)}")
Benchmark Utilities for Reminiscence Monitoring
Step one is to arrange a utility to trace GPU reminiscence (VRAM). Monitoring reminiscence utilization helps in understanding how completely different strategies influence reminiscence load throughout mannequin loading and inference. Later, we can even observe the system’s RAM (CPU reminiscence).
Right here’s the utility code for GPU reminiscence monitoring:
import gc
import time
import torch
def start_memory_tracking():
"""Initialize GPU reminiscence monitoring."""
if torch.cuda.is_available():
torch.cuda.reset_peak_memory_stats()
else:
print("This pocket book is meant for CUDA GPUs however CUDA is just not accessible.")
def print_memory_usage():
max_gpu_memory = torch.cuda.max_memory_allocated() / (1024 ** 3) # Convert bytes to GB
print(f"Most GPU reminiscence allotted: {max_gpu_memory:.1f} GB")
def cleanup():
gc.gather()
torch.cuda.empty_cache()
time.sleep(3) # Permit time for reminiscence to clear
torch.cuda.reset_peak_memory_stats()
max_memory_allocated = torch.cuda.max_memory_allocated() / (1024 ** 3)
print(f"Most GPU reminiscence allotted: {max_memory_allocated:.1f} GB")
These features assist observe GPU reminiscence utilization earlier than, throughout, and after mannequin operations. The cleanup() operate is very helpful for clearing unused reminiscence to keep away from operating out of VRAM.
Mannequin Setup
Subsequent, we arrange the mannequin. For demonstration, we’ll use the “GPT-2 giant” mannequin (although you’ll be able to modify the mannequin measurement to fit your reminiscence constraints). By altering the configuration, the mannequin measurement can vary from “gpt2-small” (124M parameters) to “gpt2-xl” (1558M parameters).
Right here’s the configuration:
from previous_chapters import GPTModel
BASE_CONFIG = {
"vocab_size": 50257, # Vocabulary measurement
"context_length": 1024, # Context size
"drop_rate": 0.0, # Dropout charge
"qkv_bias": True # Question-key-value bias
}
model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}
CHOOSE_MODEL = "gpt2-xl (1558M)"
BASE_CONFIG.replace(model_configs[CHOOSE_MODEL])
This configuration permits flexibility in selecting fashions based mostly on accessible reminiscence sources. For decrease reminiscence consumption, choosing a smaller variant (like gpt2-small) is advisable.
As soon as the mannequin configuration is ready up, the following steps will dive into loading, managing, and optimizing the mannequin weights for environment friendly reminiscence utilization.
Monitoring GPU Reminiscence Throughout Mannequin Loading
Let’s now put the GPU reminiscence monitoring utilities into motion. First, we initialize reminiscence monitoring and cargo the mannequin to watch reminiscence consumption. The code beneath tracks GPU reminiscence utilization as we load and run a GPT mannequin.
start_memory_tracking()
mannequin = GPTModel(BASE_CONFIG)
machine = torch.machine("cuda")
mannequin.to(machine)
print_memory_usage()
# Output: Most GPU reminiscence allotted: 6.4 GB
This reveals that loading and inserting the mannequin onto the GPU consumes round 6.4 GB of VRAM, which is typical for bigger fashions like GPT-2. Nonetheless, that is simply the preliminary setup.
Operating the Mannequin
To confirm that every little thing works appropriately, we move a easy enter tensor to the mannequin. Though we aren’t monitoring reminiscence throughout this step, it’s important to test that the mannequin operates as anticipated.
# Check if the mannequin works (no want to trace reminiscence right here)
test_input = torch.tensor([[1, 2, 3]]).to(machine)
mannequin.eval()
with torch.no_grad():
mannequin(test_input)
Saving the Mannequin
Now, think about we’re pretraining the mannequin (or finetuning it). For this instance, we skip the precise pretraining course of and immediately save the initialized mannequin. The next code saves the mannequin’s weights utilizing torch.save().
# Coaching code would go right here...
mannequin.prepare()
torch.save(mannequin.state_dict(), "mannequin.pth")
Reminiscence Cleanup
After saving the mannequin, it’s vital to unencumber GPU reminiscence to make sure environment friendly useful resource administration in subsequent operations. By deleting the mannequin and the take a look at enter tensor, after which operating our cleanup() operate, we clear up VRAM.
del mannequin, test_input
cleanup()
# Output: Most GPU reminiscence allotted: 0.0 GB
At this level, the GPU reminiscence utilization is reset to zero, as anticipated.
Loading Pretrained Mannequin Weights
The subsequent step entails reloading the saved mannequin weights to proceed coaching or finetuning. Nonetheless, loading pretrained weights requires extra GPU reminiscence than initializing a contemporary mannequin as a result of the mannequin’s weights are loaded twice: as soon as when loading the mannequin itself, and once more when loading the weights into reminiscence.
# Begin monitoring reminiscence
start_memory_tracking()
# Recreate the mannequin structure
mannequin = GPTModel(BASE_CONFIG)
mannequin.to(machine)
# Load the saved state_dict
mannequin.load_state_dict(
torch.load("mannequin.pth", map_location=machine, weights_only=True)
)
mannequin.to(machine)
mannequin.eval()
print_memory_usage()
# Output: Most GPU reminiscence allotted: 12.8 GB
The GPU reminiscence utilization has now doubled in comparison with the preliminary load, peaking at 12.8 GB. This occurs as a result of, for a brief interval, each the unique mannequin and the newly loaded weights are held in reminiscence. Finally, the loaded weights are copied into the mannequin, and the momentary state_dict is discarded. Nonetheless, this reminiscence spike could cause points when working with restricted sources.
Resetting GPU Reminiscence
After loading the mannequin weights and testing it, it’s important to reset the GPU reminiscence as soon as once more. Testing the mannequin ensures it really works as anticipated, and clearing reminiscence is essential for environment friendly useful resource utilization.
# Check if the mannequin works (no want to trace reminiscence right here)
test_input = torch.tensor([[1, 2, 3]]).to(machine)
mannequin.eval()
with torch.no_grad():
mannequin(test_input)
del mannequin, test_input
cleanup()
# Output: Most GPU reminiscence allotted: 0.0 GB
This reset brings GPU reminiscence utilization again to zero, making certain a clear state for future operations.
Loading Weights Sequentially
One efficient workaround for the issue of double reminiscence utilization when loading mannequin weights is sequential loading. As an alternative of loading each the mannequin and weights concurrently into GPU reminiscence, we are able to load the mannequin first, maintain the weights in CPU reminiscence, after which copy every parameter one after the other to the GPU. This methodology considerably reduces the height reminiscence utilization.
Right here’s easy methods to implement sequential weight loading:
Step-by-Step Breakdown:
- Load the Mannequin onto the GPU: First, we load the mannequin structure into GPU reminiscence, as traditional.
- Load the Weights onto the CPU: The mannequin weights are loaded onto CPU reminiscence, avoiding the preliminary reminiscence spike attributable to transferring each the mannequin and the weights to the GPU.
- Copy Weights Parameter by Parameter: Every weight is then copied sequentially from the CPU to GPU, which means that at no level do now we have each the mannequin and the complete state_dict in GPU reminiscence.
The code beneath demonstrates this method:
start_memory_tracking()
# Load the mannequin into GPU reminiscence
mannequin = GPTModel(BASE_CONFIG).to(machine)
# Load the mannequin's saved state_dict onto the CPU
state_dict = torch.load("mannequin.pth", map_location="cpu", weights_only=True)
print_memory_usage()
# Output: Most GPU reminiscence allotted: 6.4 GB
# Copy every parameter to GPU reminiscence one after the other
with torch.no_grad():
for identify, param in mannequin.named_parameters():
if identify in state_dict:
param.copy_(state_dict[name].to(machine))
else:
print(f"Warning: {identify} not present in state_dict.")
print_memory_usage()
# Output: Most GPU reminiscence allotted: 6.7 GB
Reminiscence Comparability:
- Initially, the mannequin alone occupies
- As we copy every parameter sequentially, the reminiscence will increase barely to
Nonetheless, this can be a a lot smaller peak in comparison with the 12.8 GB required when loading every little thing without delay. By sequentially loading the weights, we keep away from having each the complete mannequin and the complete set of weights in GPU reminiscence concurrently.
Mannequin Testing & Reminiscence Reset:
After copying the weights, we take a look at the mannequin to make sure every little thing works as anticipated. Lastly, we reset the GPU reminiscence to clear any lingering objects, simply as we did in earlier steps.
# Check if the mannequin works (no want to trace reminiscence right here)
test_input = torch.tensor([[1, 2, 3]]).to(machine)
mannequin.eval()
with torch.no_grad():
mannequin(test_input)
# Clear up GPU reminiscence
del mannequin, test_input, state_dict, param
cleanup()
# Output: Most GPU reminiscence allotted: 0.0 GB
Loading the Mannequin with Low CPU Reminiscence
Within the earlier part, we lowered GPU reminiscence utilization by loading mannequin weights into CPU reminiscence first after which sequentially copying them into the GPU. However what if the machine has restricted CPU reminiscence and bigger GPU reminiscence? To deal with this, we are able to use PyTorch’s “meta” machine method, which is good for machines with constrained CPU sources.
Meta Machine: A Good Tradeoff
The “meta” machine is a particular machine sort in PyTorch that creates “meta” tensors. These tensors symbolize the form and sort of the information with out allocating reminiscence for the information itself. This permits us to outline fashions with out consuming CPU or GPU reminiscence till needed.
Utilizing the meta machine, we are able to first initialize the mannequin with none reminiscence allocation, after which load the mannequin weights immediately into GPU reminiscence, bypassing the CPU.
Monitoring CPU Reminiscence Utilization
Earlier than we dive into the meta machine method, we’ll outline a utility operate to trace CPU reminiscence utilization:
import os
import psutil
from threading import Thread
def memory_usage_in_gb(func, *args, **kwargs):
course of = psutil.Course of(os.getpid())
baseline_mem = course of.memory_info().rss / 1024 ** 3 # in GB
mem_usage = []
performed = False
def monitor_memory():
whereas not performed:
mem_usage.append(course of.memory_info().rss / 1024 ** 3) # Convert to GB
time.sleep(0.1)
t = Thread(goal=monitor_memory)
t.begin()
func(*args, **kwargs)
performed = True
t.be a part of()
peak_mem_usage_gb = max(mem_usage) - baseline_mem
return peak_mem_usage_gb
Now that we are able to measure CPU reminiscence utilization, let’s observe the reminiscence used through the sequential weight loading method from the earlier part:
def load_sequentially():
start_memory_tracking()
mannequin = GPTModel(BASE_CONFIG).to(machine)
state_dict = torch.load("mannequin.pth", map_location="cpu", weights_only=True)
print_memory_usage()
# Sequentially copy weights to the mannequin's parameters
with torch.no_grad():
for identify, param in mannequin.named_parameters():
if identify in state_dict:
param.copy_(state_dict[name].to(machine))
print_memory_usage()
peak_memory_used = memory_usage_in_gb(load_sequentially)
print(f"-> Most CPU reminiscence allotted: {peak_memory_used:.1f} GB")
This method outputs:
- Most GPU reminiscence allotted: 6.7 GB
- Most CPU reminiscence allotted: 6.3 GB
Meta Machine Method
To additional cut back CPU reminiscence utilization, we are able to use the meta machine to load the mannequin with out allocating reminiscence till we really want it. Right here’s the implementation:
def load_sequentially_with_meta():
start_memory_tracking()
with torch.machine("meta"):
mannequin = GPTModel(BASE_CONFIG)
mannequin = mannequin.to_empty(machine=machine)
state_dict = torch.load("mannequin.pth", map_location=machine, weights_only=True)
print_memory_usage()
# Sequentially copy weights to the mannequin's parameters
with torch.no_grad():
for identify, param in mannequin.named_parameters():
if identify in state_dict:
param.copy_(state_dict[name])
print_memory_usage()
peak_memory_used = memory_usage_in_gb(load_sequentially_with_meta)
print(f"-> Most CPU reminiscence allotted: {peak_memory_used:.1f} GB")
Reminiscence Utilization with Meta Machine:
- Most GPU reminiscence allotted: 12.8 GB
- Most CPU reminiscence allotted: 1.3 GB
By utilizing the meta machine and immediately loading the mannequin weights into GPU reminiscence, we drastically cut back CPU reminiscence consumption from 6.3 GB to simply 1.3 GB.
Comparability with Baseline
Lastly, let’s examine this methodology with the easy PyTorch weight loading methodology, the place no meta machine or sequential loading is used:
def baseline():
start_memory_tracking()
mannequin = GPTModel(BASE_CONFIG)
mannequin.to(machine)
mannequin.load_state_dict(torch.load("mannequin.pth", map_location=machine, weights_only=True))
mannequin.to(machine)
mannequin.eval()
print_memory_usage()
peak_memory_used = memory_usage_in_gb(baseline)
print(f"-> Most CPU reminiscence allotted: {peak_memory_used:.1f} GB")
For this method:
- Most GPU reminiscence allotted: 12.8 GB
- Most CPU reminiscence allotted: 4.4 GB
Utilizing mmap=True for Environment friendly Mannequin Loading
For extra superior customers of PyTorch, there’s an alternate method to dealing with reminiscence constraints when loading giant fashions—utilizing the mmap=True setting in torch.load(). This setting leverages memory-mapped file I/O, which permits the mannequin to learn knowledge immediately from disk with out absolutely loading it into RAM. That is significantly helpful on techniques with restricted CPU reminiscence, because it minimizes the reminiscence footprint throughout mannequin loading.
What’s mmap=True?
Reminiscence-mapped I/O (mmap) is a mechanism that allows a file to be learn immediately from disk by mapping it into the digital tackle house. As an alternative of loading the whole mannequin into RAM, PyTorch can load components of the mannequin on demand, successfully decreasing reminiscence utilization. This may be significantly advantageous when coping with giant pretrained or finetuned fashions, resembling GPT-2 or GPT-3, on machines with restricted sources.
The mmap=True choice might be added when calling torch.load() to realize this habits.
Instance Implementation of mmap=True
Let’s see how the mmap=True choice works in apply. Beneath is a pattern implementation the place we load a mannequin utilizing this setting:
def best_practices():
with torch.machine("meta"):
mannequin = GPTModel(BASE_CONFIG)
mannequin.load_state_dict(
torch.load("mannequin.pth", map_location=machine, weights_only=True, mmap=True),
assign=True
)
print_memory_usage()
peak_memory_used = memory_usage_in_gb(best_practices)
print(f"-> Most CPU reminiscence allotted: {peak_memory_used:.1f} GB")
Outcomes with mmap=True
- Most GPU reminiscence allotted: 6.4 GB
- Most CPU reminiscence allotted: 5.9 GB
Right here, we see that the GPU reminiscence utilization stays environment friendly (6.4 GB), and CPU reminiscence utilization is pretty excessive as a result of the machine has sufficient CPU RAM to help it. Nonetheless, on a system with restricted CPU RAM, the mmap=True method would use much less reminiscence by avoiding loading the complete mannequin into RAM.
When to Use mmap=True
The mmap=True choice is very useful within the following eventualities:
- Restricted CPU RAM
- Disk I/O Pace
Efficiency Issues
At first look, the mmap=True method might sound much less environment friendly in comparison with the sequential weight loading method. Nonetheless, for machines with restricted CPU reminiscence, mmap=True generally is a game-changer, offering an efficient approach to load giant fashions with out overwhelming the CPU’s accessible reminiscence.
By utilizing mmap=True, you’re balancing disk entry with reminiscence availability, which may help in environments the place reminiscence is scarce however disk I/O is quick.
Different Strategies for Mannequin Weight Loading
On this pocket book, we’ve targeted on easy, built-in strategies for effectively loading mannequin weights in PyTorch, significantly when reminiscence (both GPU or CPU) is constrained. The beneficial methodology for managing restricted CPU reminiscence is the mmap=True method, as defined beforehand.
Nonetheless, for those who’re coping with excessive reminiscence limitations or want extra management over the method, there’s one other brute-force method: saving and loading every weight tensor individually.
Saving Mannequin Weights Individually
As an alternative of saving the whole state_dict as a single file, this methodology shops every mannequin parameter (tensor) individually. This lets you load every parameter one by one, stopping the necessity to maintain the whole mannequin in reminiscence concurrently.
Right here’s how one can save the mannequin weights individually:
mannequin = GPTModel(BASE_CONFIG)
# Assume `mannequin` is your skilled mannequin
state_dict = mannequin.state_dict()
# Create a listing to retailer particular person parameter information
os.makedirs("model_parameters", exist_ok=True)
# Save every parameter tensor individually
for identify, param in state_dict.gadgets():
torch.save(param.cpu(), f"model_parameters/{identify}.pt")
del mannequin # Release GPU reminiscence
This breaks the mannequin into particular person elements, saving every tensor to its personal file within the “model_parameters” listing.
Loading Weights Individually
Now, let’s see how one can load these weights one-by-one to keep away from overwhelming reminiscence utilization.
def load_individual_weights():
start_memory_tracking()
with torch.machine("meta"):
mannequin = GPTModel(BASE_CONFIG)
mannequin = mannequin.to_empty(machine=machine)
print_memory_usage()
param_dir = "model_parameters"
with torch.no_grad():
for identify, param in mannequin.named_parameters():
weight_path = os.path.be a part of(param_dir, f"{identify}.pt")
if os.path.exists(weight_path):
param_data = torch.load(weight_path, map_location="cpu", weights_only=True)
param.copy_(param_data.to(machine)) # Transfer tensor to GPU
del param_data # Free reminiscence after copying
else:
print(f"Warning: {identify} not present in {param_dir}.")
print_memory_usage()
Outcomes from Particular person Weight Loading
- Most GPU reminiscence allotted: 6.4 GB
- Most CPU reminiscence allotted: 0.3 GB
The reminiscence footprint right here is considerably lowered—each on the GPU and CPU. By loading weights individually, you make sure that no pointless reminiscence is consumed at any stage, making this method perfect for very memory-limited environments.
When to Use This Technique
- Excessive Reminiscence Limitations
When CPU and GPU reminiscence are each extremely constrained, this methodology presents exact management, making certain that just one parameter tensor is loaded into reminiscence at any given time.
On machines the place you can’t afford to make use of greater than minimal sources, this brute-force methodology gives an answer to make sure you can load even the most important fashions.
Efficiency Issues
The trade-off right here is efficiency. Since every tensor is loaded individually, this methodology incurs further disk I/O, which can decelerate the loading course of in comparison with strategies that load the whole mannequin or bigger chunks of knowledge without delay.
When working with giant fashions, resembling GPT variants or different deep studying fashions, reminiscence effectivity is essential. Strategies like sequential weight loading, utilizing the meta machine, and enabling mmap=True
assist cut back reminiscence utilization on each CPU and GPU. These strategies, recognized for memory-efficient mannequin weight loading in PyTorch, are extremely versatile and might be tailored relying on the precise constraints of your {hardware} atmosphere, whether or not you might have restricted CPU RAM, GPU VRAM, or each.
By using these strategies, you’ll be able to work with giant fashions even on constrained {hardware}, making certain easy mannequin coaching and fine-tuning workflows.
Hope you just like the article! Reminiscence-efficient mannequin weight loading in PyTorch helps save sources. Utilizing reminiscence environment friendly mannequin weight loading in Python can cut back overhead. For a reminiscence environment friendly mannequin weight loading in PyTorch instance, attempt utilizing torch.load()
with reminiscence mapping to decrease RAM utilization.
Ceaselessly Requested Questions
As deep studying fashions develop bigger (particularly fashions like GPT-2, GPT-3), effectively loading these fashions turns into important to forestall operating out of GPU or CPU reminiscence. Reminiscence-efficient loading lets you work with giant fashions even in constrained environments.
You should use the features torch.cuda.reset_peak_memory_stats() and torch.cuda.max_memory_allocated() to trace GPU reminiscence utilization earlier than, throughout, and after loading or coaching fashions. The supplied utility features assist monitor reminiscence utilization effectively.
Sequential weight loading entails loading the mannequin structure onto the GPU after which transferring weights one by one from CPU to GPU. This reduces the height reminiscence utilization in comparison with loading each the mannequin and its weights without delay, serving to handle restricted GPU reminiscence.
Use decrease precision: float16, blended precision.
Optimize tensor operations: keep away from copies, environment friendly shapes, views.
Gradient accumulation: replace weights much less ceaselessly.
Scale back mannequin measurement: prune connections, quantize weights, smaller fashions.
Optimize knowledge loading: knowledge loaders, prefetching, memory-mapped information.
GPU reminiscence effectivity: monitor utilization, free unused reminiscence, a number of GPUs.
Superior strategies: information distillation, low-rank approximation.
The “meta” machine lets you initialize fashions with out allocating reminiscence for his or her parameters. That is helpful when you might have restricted CPU reminiscence since you’ll be able to later load weights immediately into the GPU, bypassing the necessity for giant reminiscence allocations on the CPU.