Unlocking the ability of domain-specific Massive Language Fashions like Microsoft Phi-4 requires the flexibility to fine-tune these fashions for specialised duties. Superb-tuning Phi-4 on customized datasets helps tailor the mannequin to carry out optimally in particular domains, reminiscent of buyer assist, medical recommendation, or technical documentation. By leveraging LoRA (Low-Rank Adaptation) adapters, this course of turns into extra environment friendly, permitting for quicker coaching and lowered useful resource consumption. This information will stroll you thru the important steps to fine-tune Phi-4 utilizing LoRA adapters, combine the mannequin with Hugging Face for simple sharing, and apply the most recent strategies to get probably the most out of your customized LLM.
Studying Targets
- Discover ways to fine-tune Microsoft Phi-4 utilizing LoRA adapters for domain-specific duties.
- Perceive the setup course of and configuration for loading Phi-4 effectively with 4-bit quantization.
- Acquire proficiency in making ready datasets and reworking them for fine-tuning with Hugging Face and unsloth.
- Grasp coaching strategies utilizing Hugging Face’s SFTTrainer to optimize mannequin efficiency.
- Discover the way to monitor GPU utilization and save/add fine-tuned fashions to Hugging Face for deployment.
This text was revealed as part of the Knowledge Science Blogathon.
Conditions
Earlier than diving into fine-tuning Phi-4, guarantee you have got the mandatory instruments and setting configured. This consists of putting in Python 3.8+, PyTorch with CUDA assist for GPU acceleration, and the unsloth library, together with Hugging Face Transformers and Datasets for seamless dataset dealing with and mannequin integration. Having these stipulations in place will guarantee a easy and environment friendly fine-tuning course of.
Guarantee you have got the next put in:
- Python 3.8+
- PyTorch (with CUDA assist for GPU acceleration)
- unsloth
- Hugging Face Transformers and Datasets
Set up the required libraries with:
pip set up unsloth
pip set up --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
Superb-Tuning Phi-4: A Step-by-Step Information
This part covers all of the important steps concerned in fine-tuning Microsoft Phi-4, from organising the setting to pushing the fine-tuned mannequin to Hugging Face. It consists of configuring the mannequin, making ready the dataset, coaching, monitoring GPU utilization, producing responses, and saving/importing the mannequin.
Step 1: Setting Up the Mannequin
Under we can be organising the mannequin by loading the mannequin and importing the dependencies:
Load the Mannequin with LoRA Adapters
LoRA adapters allow parameter-efficient fine-tuning by coaching solely a small subset of mannequin parameters.
Importing Dependencies
from unsloth import FastLanguageModel
import torch
- FastLanguageModel: A utility class from the unsloth library for working with language fashions, together with loading and fine-tuning.
- torch: PyTorch library for deep studying operations, offering GPU acceleration.
Configuration Settings
max_seq_length = 2048
load_in_4bit = True
- max_seq_length: Specifies the utmost size of enter sequences. Fashions like Phi-4 are designed to deal with lengthy sequences effectively, making this significant.
- load_in_4bit: This setting hundreds the mannequin with 4-bit quantization, decreasing reminiscence utilization and enhancing inference velocity.
Loading the Phi-4 Mannequin
mannequin, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Phi-4",
max_seq_length=max_seq_length,
load_in_4bit=load_in_4bit,
)
- model_name: Refers back to the pre-trained Phi-4 mannequin hosted by unsloth.
- from_pretrained: Downloads and initializes the mannequin and tokenizer with the required configurations.
Making use of LoRA Adapters
mannequin = FastLanguageModel.get_peft_model(
mannequin,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
)
- get_peft_model: A way to combine LoRA adapters into the mannequin for parameter-efficient fine-tuning.
- r=16: Units the rank of the LoRA layers, controlling the dimensionality of the extra trainable parameters.
- target_modules: Specifies the mannequin layers the place LoRA adapters can be utilized. These layers correspond to key elements of the mannequin’s transformer structure.
- lora_alpha: A scaling issue for the LoRA layers to stabilize coaching.
- lora_dropout: Dropout likelihood for regularization; set to 0 for no dropout.
- bias=”none”: Signifies that no extra bias phrases are launched.
- use_gradient_checkpointing: Prompts gradient checkpointing to scale back reminiscence utilization throughout backpropagation.
- random_state=3407: Ensures reproducibility by fixing the random seed.
Step 2: Getting ready the Dataset
We use the FineTome-100k dataset in ShareGPT format. The unsloth library gives utilities to transform this format into Hugging Face’s generic format for multi-turn conversations.
Load the Dataset
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt, get_chat_template
dataset = load_dataset("mlabonne/FineTome-100k", break up="practice")
The Hugging Face’s datasets library hundreds the mlabonne/FineTome-100k dataset and ensures that solely the coaching break up is loaded with the break up=”practice” argument.
Standardize the Dataset
dataset = standardize_sharegpt(dataset)
The standardize_sharegpt perform from the unsloth.chat_templates module standardizes the dataset to the ShareGPT format. This ensures that the dataset adheres to the anticipated format for multi-turn conversations.
Apply Phi-4 Chat Template
tokenizer = get_chat_template(tokenizer, chat_template="phi-4")
The get_chat_template perform customizes the tokenizer to make use of the “phi-4” chat template. This ensures the prompts and conversations align with Phi-4’s format.
Format Prompts for Coaching
def formatting_prompts_func(examples):
texts = [
tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
for convo in examples["conversations"]
]
return {"textual content": texts}
The formatting_prompts_func processes every instance within the dataset:
- The examples[“conversations”] subject comprises dialog information.
- Every dialog (convo) is handed by tokenizer.apply_chat_template.
- tokenize=False ensures the output is just not tokenized but.
- add_generation_prompt=False avoids appending generation-specific tokens to the prompts at this stage.
- The formatted textual content is saved underneath the “textual content” subject.
Map Operate to Dataset
dataset = dataset.map(formatting_prompts_func, batched=True)
The map perform applies formatting_prompts_func to the whole dataset in batches. This effectively preprocesses the dataset to arrange it for fine-tuning.
We take a look at how the conversations are structured for merchandise 5:
dataset[5]["conversations"]
Step 3: Superb-Tuning the Mannequin
Superb-tuning the Mannequin entails coaching Phi-4 with Hugging Face’s SFTTrainer, optimizing the method with customized settings and environment friendly information dealing with.
Coaching with SFTTrainer
We use Hugging Face’s SFTTrainer to coach the mannequin. Under is a minimal setup for environment friendly coaching:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
Masking Person Inputs
To coach solely on assistant responses, we masks person inputs utilizing the train_on_responses_only utility:
from unsloth.chat_templates import train_on_responses_only
coach = train_on_responses_only(
coach,
instruction_part="<|im_start|>person<|im_sep|>",
response_part="<|im_start|>assistant<|im_sep|>",
)
- SFTTrainer: A specialised coach for supervised fine-tuning of language fashions.
- TrainingArguments: Defines coaching hyperparameters reminiscent of batch dimension, studying fee, and variety of steps.
- DataCollatorForSeq2Seq: Prepares enter information for sequence-to-sequence fashions.
- is_bfloat16_supported: Checks if the system helps bfloat16, a mixed-precision format.
coach = SFTTrainer(
mannequin=mannequin,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="textual content",
max_seq_length=max_seq_length,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
dataset_num_proc=2,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=30,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
output_dir="outputs",
report_to="none",
),
)
Coach Initialization:
- mannequin and tokenizer: The language mannequin and its tokenizer are handed in, usually pre-configured.
- train_dataset: The dataset used for coaching, preprocessed and tokenized earlier.
- dataset_text_field: Specifies the sector within the dataset containing the textual content.
- max_seq_length: The utmost sequence size for tokenized inputs.
- data_collator: Ensures enter information is correctly batched and padded.
- dataset_num_proc: Parallelizes dataset processing for effectivity.
Coaching Arguments:
- per_device_train_batch_size: Batch dimension for every gadget throughout coaching (set to 2 right here).
- gradient_accumulation_steps: Simulates a bigger batch dimension by accumulating gradients over a number of steps.
- warmup_steps: Steps for studying fee warmup, serving to stabilize coaching.
- max_steps: Whole variety of coaching steps (30 right here, indicating a brief coaching run).
- learning_rate: Studying fee for the optimizer.
- fp16 and bf16: Allow combined precision (FP16 or BF16) based mostly on {hardware} assist for quicker and memory-efficient coaching.
- logging_steps: Frequency of logging throughout coaching.
- optim: Optimizer selection; adamw_8bit reduces reminiscence utilization.
- weight_decay: Regularization parameter to forestall overfitting.
- output_dir: This listing saves the mannequin checkpoints and logs.
- report_to: Disables reporting to exterior monitoring instruments (e.g., WandB).
Goal:
This setup effectively fine-tunes a big mannequin on a customized dataset, specializing in:
- Reminiscence optimization (e.g., combined precision, 8-bit optimizers).
- Environment friendly coaching configurations with a small batch dimension and gradient accumulation.
- Brief, light-weight coaching for fast experimentation or area adaptation.
We are able to additionally use Unsloth’s train_on_completions methodology to solely practice on the assistant outputs and ignore the loss on the person’s inputs.
from unsloth.chat_templates import train_on_responses_only
coach = train_on_responses_only(
coach,
instruction_part="<|im_start|>person<|im_sep|>",
response_part="<|im_start|>assistant<|im_sep|>",
)
Let’s confirm masking is definitely finished:
tokenizer.decode(coach.train_dataset[5]["input_ids"])
house = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])
Step 4: Monitoring GPU Utilization
Verify GPU reminiscence utilization earlier than and after coaching:
import torch
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = spherical(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = spherical(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.identify}. Max reminiscence = {max_memory} GB.")
print(f"{start_gpu_memory} GB of reminiscence reserved.")
Step 5: Inference
Generate responses utilizing the fine-tuned mannequin:
Defining the Enter Messages:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "phi-4",
)
FastLanguageModel.for_inference(mannequin) # Allow native 2x quicker inference
messages = [
{"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]
The enter is structured as a listing of message dictionaries. Every dictionary specifies the position (e.g., “person”) and the content material (e.g., the person’s question).
This strategy helps multi-turn conversations, aligning with the mannequin’s chat-based performance
Preprocessing Inputs with the Tokenizer
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
- apply_chat_template: Prepares the enter for the Phi-4 mannequin utilizing the tokenizer and ensures compatibility with the chat format.
Parameters:
- tokenize=True: Converts textual content into token IDs.
- add_generation_prompt=True: Provides a particular immediate token to information the mannequin’s response era.
- return_tensors=”pt”: Converts the processed information into PyTorch tensors for GPU processing.
- .to(“cuda”): Strikes the information to the GPU for accelerated computation.
Producing Textual content:
outputs = mannequin.generate(
input_ids=inputs, max_new_tokens=64, use_cache=True, temperature=1.5, min_p=0.1
)
Parameters:
- input_ids=inputs: The tokenized enter.
- max_new_tokens=64: Limits the size of the generated output to 64 tokens.
- use_cache=True: Quickens era by utilizing cached activations.
- temperature=1.5: Controls randomness in output (larger values = extra inventive, much less deterministic).
- min_p=0.1: Ensures variety by setting a minimal likelihood threshold for token sampling.
Decoding and Displaying the Output:
print(tokenizer.batch_decode(outputs))
- Decodes the generated token IDs again into human-readable textual content utilizing the tokenizer.
- We use batch_decode as a result of the outputs may include a number of sequences.
FastLanguageModel.for_inference(mannequin) # Allow native 2x quicker inference
messages = [
{"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True, # Should add for era
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = mannequin.generate(
input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.5, min_p = 0.1
Step 6: Saving and Importing the Mannequin
Save Regionally or Push to Hugging Face:
mannequin.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
To add to Hugging Face:
mannequin.push_to_hub_merged("hf/mannequin", tokenizer, save_method="lora", token="<your_hf_token>")
This code pushes a mannequin to the Hugging Face Hub, utilizing the LoRA methodology for environment friendly saving, and it additionally consists of the related tokenizer. You would wish a sound Hugging Face authentication token (<your_hf_token>) to execute the motion efficiently.
Conclusion
Superb-tuning Microsoft Phi-4 domestically and pushing it to Hugging Face permits builders to create extremely specialised fashions effectively. With instruments like Unsloth, LoRA Adapters, and Hugging Face, the method turns into accessible and scalable. Strive it out together with your dataset at this time!
Key Takeaways
- Superb-tuning Microsoft Phi-4 with LoRA adapters optimizes domain-specific efficiency whereas saving computational sources.
- The Unsloth library simplifies the method of integrating LoRA adapters and dealing with Hugging Face datasets.
- Environment friendly dataset transformation and tokenization are vital for making ready information for fine-tuning Phi-4 on customized duties.
- Coaching with Hugging Face’s SFTTrainer and superior settings permits for quick, memory-efficient fine-tuning.
- Importing fine-tuned fashions to Hugging Face allows straightforward sharing and deployment for specialised functions.
Regularly Requested Questions
A. Microsoft Phi-4 is a big language mannequin (LLM) optimized for language understanding and era duties. Superb-tuning it on a customized dataset allows domain-specific efficiency, tailoring the mannequin to specialised functions reminiscent of customer support, technical documentation, or area of interest industries.
A. LoRA (Low-Rank Adaptation) adapters permit environment friendly fine-tuning by coaching solely a subset of mannequin parameters as an alternative of the whole mannequin. This reduces computational necessities and reminiscence utilization, making it perfect for giant fashions like Phi-4.
A. Key necessities embody Python 3.8+, PyTorch with CUDA assist, the unsloth library for streamlined workflows, and Hugging Face Transformers and Datasets for dataset dealing with and coaching.
A. Use a dataset like FineTome-100k in ShareGPT format. Convert and standardize the dataset utilizing unsloth utilities to make sure compatibility with Hugging Face’s multi-turn dialog template.
A. Save your fine-tuned mannequin and tokenizer domestically, then use the .push_to_hub_merged() methodology from unsloth to add the mannequin and tokenizer to Hugging Face together with your authentication token
The media proven on this article is just not owned by Analytics Vidhya and is used on the Creator’s discretion.