Right now, we’re saying the overall availability of Amazon SageMaker HyperPod recipes to assist knowledge scientists and builders of all ability units to get began coaching and fine-tuning basis fashions (FMs) in minutes with state-of-the-art efficiency. They will now entry optimized recipes for coaching and fine-tuning well-liked publicly out there FMs equivalent to Llama 3.1 405B, Llama 3.2 90B, or Mixtral 8x22B.
At AWS re:Invent 2023, we launched SageMaker HyperPod to cut back time to coach FMs by as much as 40 % and scale throughout greater than a thousand compute sources in parallel with preconfigured distributed coaching libraries. With SageMaker HyperPod, yow will discover the required accelerated compute sources for coaching, create essentially the most optimum coaching plans, and run coaching workloads throughout totally different blocks of capability primarily based on the supply of compute sources.
SageMaker HyperPod recipes embody a coaching stack examined by AWS, eradicating tedious work experimenting with totally different mannequin configurations, eliminating weeks of iterative analysis and testing. The recipes automate a number of essential steps, equivalent to loading coaching datasets, making use of distributed coaching methods, automating checkpoints for sooner restoration from faults, and managing the end-to-end coaching loop.
With a easy recipe change, you’ll be able to seamlessly change between GPU- or Trainium-based situations to additional optimize coaching efficiency and scale back prices. You’ll be able to simply run workloads in manufacturing on SageMaker HyperPod or SageMaker coaching jobs.
SageMaker HyperPod recipes in motion
To get began, go to the SageMaker HyperPod recipes GitHub repository to browse coaching recipes for well-liked publicly out there FMs.
You solely must edit easy recipe parameters to specify an occasion kind and the situation of your dataset in cluster configuration, then run the recipe with a single line command to attain state-of-art efficiency.
You should edit the recipe config.yaml file to specify the mannequin and cluster kind after cloning the repository.
$ git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
$ cd sagemaker-hyperpod-recipes
$ pip3 set up -r necessities.txt.
$ cd ./recipes_collections
$ vim config.yaml
The recipes help SageMaker HyperPod with Slurm, SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS), and SageMaker coaching jobs. For instance, you’ll be able to arrange a cluster kind (Slurm orchestrator), a mannequin title (Meta Llama 3.1 405B language mannequin), an occasion kind (ml.p5.48xlarge
), and your knowledge areas, equivalent to storing the coaching knowledge, outcomes, logs, and so forth.
defaults:
- cluster: slurm # help: slurm / k8s / sm_jobs
- recipes: fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora # title of mannequin to be educated
debug: False # set to True to debug the launcher configuration
instance_type: ml.p5.48xlarge # or different supported cluster situations
base_results_dir: # Location(s) to retailer the outcomes, checkpoints, logs and so forth.
You’ll be able to optionally alter model-specific coaching parameters on this YAML file, which outlines the optimum configuration, together with the variety of accelerator units, occasion kind, coaching precision, parallelization and sharding methods, the optimizer, and logging to observe experiments via TensorBoard.
run:
title: llama-405b
results_dir: ${base_results_dir}/${.title}
time_limit: "6-00:00:00"
restore_from_path: null
coach:
units: 8
num_nodes: 2
accelerator: gpu
precision: bf16
max_steps: 50
log_every_n_steps: 10
...
exp_manager:
exp_dir: # location for TensorBoard logging
title: helloworld
create_tensorboard_logger: True
create_checkpoint_callback: True
checkpoint_callback_params:
...
auto_checkpoint: True # for automated checkpointing
use_smp: True
distributed_backend: smddp # optimized collectives
# Begin coaching from pretrained mannequin
mannequin:
model_type: llama_v3
train_batch_size: 4
tensor_model_parallel_degree: 1
expert_model_parallel_degree: 1
# different model-specific params
To run this recipe in SageMaker HyperPod with Slurm, you could put together the SageMaker HyperPod cluster following the cluster setup instruction.
Then, connect with the SageMaker HyperPod head node, entry the Slurm controller, and duplicate the edited recipe. Subsequent, you run a helper file to generate a Slurm submission script for the job that you should utilize for a dry run to examine the content material earlier than beginning the coaching job.
$ python3 primary.py --config-path recipes_collection --config-name=config
After coaching completion, the educated mannequin is mechanically saved to your assigned knowledge location.
To run this recipe on SageMaker HyperPod with Amazon EKS, clone the recipe from the GitHub repository, set up the necessities, and edit the recipe (cluster: k8s
) in your laptop computer. Then, create a hyperlink between your laptop computer and operating the EKS cluster and subsequently use the HyperPod Command Line Interface (CLI) to run the recipe.
$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora
--persistent-volume-claims fsx-claim:knowledge
--override-parameters
'{
"recipes.run.title": "hf-llama3-405b-seq8k-gpu-qlora",
"recipes.exp_manager.exp_dir": "/knowledge/<your_exp_dir>",
"cluster": "k8s",
"cluster_type": "k8s",
"container": "658645717510.dkr.ecr.<area>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
"recipes.mannequin.knowledge.train_dir": "<your_train_data_dir>",
"recipes.mannequin.knowledge.val_dir": "<your_val_data_dir>",
}'
You can even run recipe on SageMaker coaching jobs utilizing SageMaker Python SDK. The next instance is operating PyTorch coaching scripts on SageMaker coaching jobs with overriding coaching recipes.
...
recipe_overrides = {
"run": {
"results_dir": "/choose/ml/mannequin",
},
"exp_manager": {
"exp_dir": "",
"explicit_log_dir": "/choose/ml/output/tensorboard",
"checkpoint_dir": "/choose/ml/checkpoints",
},
"mannequin": {
"knowledge": {
"train_dir": "/choose/ml/enter/knowledge/practice",
"val_dir": "/choose/ml/enter/knowledge/val",
},
},
}
pytorch_estimator = PyTorch(
output_path=<output_path>,
base_job_name=f"llama-recipe",
position=<position>,
instance_type="p5.48xlarge",
training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",
recipe_overrides=recipe_overrides,
sagemaker_session=sagemaker_session,
tensorboard_output_config=tensorboard_output_config,
)
...
As coaching progresses, the mannequin checkpoints are saved on Amazon Easy Storage Service (Amazon S3) with the totally automated checkpointing functionality, enabling sooner restoration from coaching faults and occasion restarts.
Now out there
Amazon SageMaker HyperPod recipes at the moment are out there within the SageMaker HyperPod recipes GitHub repository. To study extra, go to the SageMaker HyperPod product web page and the Amazon SageMaker AI Developer Information.
Give SageMaker HyperPod recipes a try to ship suggestions to AWS re:Put up for SageMaker or via your ordinary AWS Help contacts.
— Channy