Develop Solutions on Intel® Gaudi® AI Accelerators

Fine-Tune Use Case on Intel® Gaudi® 2 AI Accelerators

Learn how to run a typical model fine-tuning use case on the Intel® Gaudi® AI accelerator. Select a model, set up the environment, and run the workload. Intel Gaudi accelerators support PyTorch* as the main framework for fine-tuning.

Run Fine-Tuning

Fine-tuning on the Intel Gaudi AI accelerator is streamlined, and the code takes you step-by-step through the following items:

Get access to a node for the Intel Gaudi AI accelerator on the Intel® Tiber™ AI Cloud.
Ensure that all the software is installed and configured properly by running the PyTorch* version of the Docker* image for the accelerator.
Select the model to run by loading the desired model repository and appropriate libraries for model acceleration.
Run the model and extract the details for evaluation.

Access Models

Accessing models for running fine-tuning can be found in four main ways:

Using Hugging Face* models with the Optimum for Intel library at Hugging Face.
Using the Intel Gaudi AI accelerator model references repository to use built-in PyTorch models.
Using the GPU Migration toolkit to automatically convert GPU-based models to be compatible with Intel Gaudi AI accelerators.
Manual migration from PyTorch models in the public domain.

The Optimum for Intel library at Hugging Face and the model-reference repository contain fully optimized and fully documented model examples. Use them as a starting point for running a model.

This example shows model fine-tuning with Hugging Face by running the Meta* Llama-3-70b-Instruct model using the Optimum for Intel library at Hugging Face. Since Hugging Face models are used with an associated task, run fine-tuning with the language-modeling task.

Setup Instructions
Run and Fine-Tune

Runtime Instructions

The Following are the run instructions needed to setup the node, the model infrastructure and the full runtimes for the model.

Accessing the Intel Gaudi Node

To access an Intel Gaudi node in the Intel Tiber AI Cloud, go to Intel Tiber AI Cloud console and access the hardware instances to select the Intel Gaudi 2 platform for deep learning and follow the steps to start and connect to the node.

The website will provide an ssh command to login to the node, and it’s advisable to add a local port forwarding to the command to be able to access a local Jupyter Notebook. For example, add the command: ssh -L 8888:localhost:8888 .. to be able to access the notebook.

Details about setting up Jupyter Notebooks on an Intel Gaudi Platform are available here.

Docker Setup

With access to the node, use the latest Intel Gaudi Docker image by first calling the Docker run command which will automatically download and run the Docker:

docker run -itd --name Gaudi_Docker --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest

Start the Docker and enter the Docker environment by issuing the following command:

docker exec -it Gaudi_Docker bash

More information on Gaudi Docker setup and validation can be found here.

Model Setup

Once the Docker environment is running, install the remaining libraries and model repositories.

Start in the root directory and install the DeepSpeed Library. DeepSpeed improves memory consumption on Intel Gaudi while running large language models.

cd ~ 
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0

Now install the Hugging Face Optimum Intel Gaudi library and GitHub Examples, selecting the latest validated release of optimum-habana:

pip install optimum-habana==1.15.0
git clone -b v1.15.0 https://github.com/huggingface/optimum-habana

Finally, transition to the language-modeling example and install the final set of requirements to run the model:

cd ~/optimum-habana/examples/language-modeling  
pip install -r requirements.txt

How to Access and Use the Llama 3 Model

Use of the pre-trained model is subject to compliance with third-party licenses, including the “META LLAMA 3 COMMUNITY LICENSE AGREEMENT”. For guidance on the intended use of the LLAMA 3 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions. Users bear sole liability and responsibility to follow and comply with any third-party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third-party licenses. To be able to run gated models like this Llama-3-70b, perform the following step:

Have a Hugging Face account and agree to the terms of use of the model in its model card on the Hugging Face Hub
Create a read token and request access to the Llama 3 model from meta-llama
Login to your account using the Hugging Face CLI:

huggingface-cli login --token <your_hugging_face_token_here>

To run with the associated Jupyter Notebook for fine-tuning, please see the running and fine-tuning addendum section for set up of the Jupyter Notebook. You can run these steps directly in the Jupyter interface.

Intel® Tiber™ AI Cloud

Jupyter* Notebook

Run and Fine-Tune

Fine-tuning a Simple GPT Model

Start with a simple example of fine-tuning from the Hugging Face* language modeling page. This is using the wikitext dataset to fine-tune the gpt2 model. The fine-tuning of this model takes only a few minutes and the fine-tuned model output is placed in the test_clm folder.

python run_clm.py \
--model_name_or_path gpt2 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--do_train \
--do_eval \
--overwrite_output_dir \
--report_to none \
--output_dir ./test-clm \
--gaudi_config_name Habana/gpt2 \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs \
--throughput_warmup_steps 3

Fine-tuning the Llama 3 70B Model

Once simple fine-tuning is complete, start running the full Llama 3 70 model for fine-tuning. Since the Llama 3 70B is a large model, employ the DeepSpeed* library to more efficiently manage the memory usage of the local HBM memory on each Intel Gaudi card. This example also deploys some additional techniques for fine-tuning:

Parameter Efficient fine-tuning (PEFT) is a strategy for adapting large pre-trained language models to specific tasks. Instead of fine-tuning the entire pre-trained model, PEFT adds a task-specific layer or a few task-specific layers on top of the pre-trained model. These additional layers are relatively smaller and have fewer parameters compared to the base model.
DeepSpeed significantly optimizes training efficiency, reducing both computational and memory requirements. It enables the handling of extremely large models by providing advanced parallelism techniques and memory optimization strategies
Flash attention is used to reduce memory usage and enhance computational speed through a fused implementation. This includes the use of the FusedSDPA (Scaled Dot Product Attention) applies similar principles to the Intel Gaudi processor environment, optimizing the scaled dot product attention function with reduced memory usage and faster performance while maintaining compatibility with standard PyTorch* functionality.
Setting epochs = 2; this is enough to ensure that the training loss is below 1.0, running any more epoch is not needed.

PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 \
python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_lora_clm.py \
--model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
--deepspeed llama2_ds_zero3_config.json \
--dataset_name tatsu-lab/alpaca \
--bf16 True \
--output_dir ./llama3_fine_tuning_output \
--num_train_epochs 2 \
--max_seq_len 2048 \
--per_device_train_batch_size 10 \
--per_device_eval_batch_size 10 \
--gradient_checkpointing \
--evaluation_strategy epoch \
--eval_delay 2 \
--save_strategy no \
--learning_rate 0.0018 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--dataset_concatenation \
--attn_softmax_bf16 True \
--do_train \
--do_eval \
--use_habana \
--use_lazy_mode \
--pipelining_fwd_bwd \
--throughput_warmup_steps 3 \
--report_to none \
--lora_rank 4 \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
--validation_split_percentage 4 \
--use_flash_attention True \
--flash_attention_causal_mask True

The result of the run shows that the fine-tuning of the model required only 38 minutes and achieved 2.2 samples (or sentences) per second.

***** train metrics ***** 
  epoch                       =        2.0 
  max_memory_allocated (GB)   =      94.53 
  memory_allocated (GB)       =      27.15 
  total_flos                  =  1037280GF 
  total_memory_available (GB) =      94.62 
  train_loss                  =     1.1525 
  train_runtime               = 0:38:47.30 
  train_samples_per_second    =      2.221 
  train_steps_per_second      =      0.028

The output of the run is in the llama3_fine_tuning_output folder. The full model is the adapter_model.safetensors which contains the additional weights generated by the parameter efficient fine-tuning. These weights can used for inference.

Large Language Model Training

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in