Pre-Training the BERT 1.5B model with DeepSpeed

Optimize with Intel® Gaudi® AI Accelerators

  • Create new deep learning models or migrate existing code in minutes.

  • Deliver generative AI performance with simplified development and increased productivity.

author-image

By

In this post, we show you how to run Habana’s DeepSpeed enabled BERT1.5B model from our Model-References repository.

This BERT model is based on Habana’s existing BERT model, just scaled up to a larger size.  The focus on this example is pre-training on the Wikipedia dataset.   Once trained, the model can be fine-tuned on different datasets for multiple types of tasks such as question and answering, translation, or text generation.

This specific  model is based on the standard BERT architecture and contains 48 main layers, 1,600 hidden layers and 25 attention heads. The combination of these parameters results in a 1.5 billion parameter model.  

Models at this size with more and more parameters may no longer fit in the device’s  memory (causing out of memory errors). In this example, we are using Habana’s fork of the DeepSpeed library to optimize system parameters, including optimizer states and gradients to ensure that the model can execute on Gaudi for the best performance. 

DeepSpeed is an open-source deep learning optimization library for PyTorch that is designed to reduce computing power and memory use and to train large, distributed models with better parallelism. DeepSpeed includes the Zero Redundancy Optimizer (ZeRO) for training models.  The details of the ZeRO optimizer will be discussed later in this document.  The DeepSpeed library is implemented between the user’s model and the PyTorch framework, so minimal changes are needed to the existing PyTorch model.  For more information on how Habana uses the ZeRO optimizer, you can refer to our previous blog on Memory-Efficient training here.

Pretraining diagram

 

You can follow these simple steps to get up and running on first-gen Gaudi® or Gaudi®2. The first step is to set up the environment, which includes setting up an instance of Gaudi devices, the SynapseAI Software stack, the additional software requirements including Habana’s version of the DeepSpeed library, and then downloading the appropriate dataset for training.   The second step is to run the model itself.

Set Up the Environment

Set up your cloud computing environment to get access to the Gaudi accelerator.  There are two  options available in the cloud today:

  • Amazon EC2 DL1 Instances: based on first-gen Gaudi
    • Users can refer to Habana’s quick start guide here for instructions on how to start a DL1 instance; an AWS user account is required.
  • Intel Developer Cloud using Gaudi2
    • Instructions are provided on the Developer page; a user account will need to be created.

Both instances have eight Gaudi devices for use.  Users must ensure that they have ~1TB of storage in the instance to be able to download the dataset and have space for execution.  The download of this dataset may take several hours to execute.   It is recommended to select Ubuntu20.04 as the base OS and include the Habana full SynapseAI Software stack and drivers in the image.  Once you have set up the environment,  install the PyTorch Docker image from Habana’s Vault by following the instructions below.   See the Habana Installation Guide Driver and Software Installation for more details.

docker pull vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/pytorch-installer-1.13.0:latest

 

~$ docker pull vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/pytorch-installer-1.13.0:latest
latest: Pulling from gaudi-docker/1.7.1/ubuntu20.04/habanalabs/pytorch-installer-1.13.0
846c0b181fff: Pull complete
6ae2e14c2539: Downloading [=====>                                             ]  24.85MB/234.5MB
3fe89580045c: Download complete
1ac28d0b180f: Download complete
c993c6ef6fe8: Downloading [==============================================>    ]  27.71MB/30.02MB
6d0e3696a459: Downloading [==============================>                    ]  2.105MB/3.479MB
05d721556cd2: Waiting
a39276db8326: Pulling fs layer
930126633dfb: Waiting
---
ba5ed4a169e5:Download compelte
Code language: PHP (php)
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host \
vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/pytorch-installer-1.13.0:latest

 

Next,  clone Habana’s Model-References repository to get access to the BERT 1.5B model and install the associated software to run the model.

git clone -b 1.7.1 https://github.com/HabanaAI/Model-References
export PYTHONPATH=/root/Model-References:$PYTHONPATH
export PYTHON=/usr/bin/python3.8
cd /root/Model-References/PyTorch/nlp/pretraining/deepspeed-bert/
pip install -r ./requirements.txt

 

root@ubuntu2004:~# cd ~
root@ubuntu2004:~# git clone -b 1.7.1 https://github.com/HabanaAI/Model-References
Cloning into 'Model-References'...
remote: Enumerating objects: 15256, done.
remote: Counting objects: 100% (15255/15255), done.
remote: Compressing objects: 100% (6660/6660), done.
remote: Total 15256 (delta 8238), reused 15132 (delta 8146), pack-reused 1
Receiving objects: 100% (15256/15256), 101.59 MiB | 8.08 MiB/s, done.
Resolving deltas: 100% (8238/8238), done.
Code language: PHP (php)

Then install Habana’s DeepSpeed fork that includes additional optimizations for performance and functionality on the Gaudi HPU.

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.7.1

 

root@ubuntu2004:~/Model-References/PyTorch/nlp/pretraining/deepspeed-bert# pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.7.1
Collecting git+https://github.com/HabanaAI/DeepSpeed.git@1.7.1
  Building wheels for collected packages: deepspeed
  Building wheel for deepspeed (setup.py) ... done
  Created wheel for deepspeed:
Successfully installed deepspeed-0.7.0+309ca18 hjson-3.1.0 psutil-5.9.4 py-cpuinfo-9.0.0 pydantic-1.10.4
Code language: PHP (php)

Finally,  download and prepare the dataset for training.  The script below will download the Wikipedia dataset as a starting point for running the model. The download of the Wikipedia dataset will require over 600GB of storage, so ensure that the cloud computing instance you select is created with ~1TB of storage to allow download of the dataset and runtime execution. Our model has a simple script that will download the dataset, establish some baseline weights in the model (instead of starting from a set of random values), format and shard the text files, and download the .txt files needed for training.

cd ./data
bash create_datasets_from_start.sh

 

root@ubuntu2004:~/Model-References/PyTorch/nlp/pretraining/deepspeed-bert/data# ./create_datasets_from_start.sh
Checkout WikiExtractor repository
Cloning into 'wikiextractor'...
remote: Enumerating objects: 771, done.
remote: Counting objects: 100% (30/30), done.
remote: Compressing objects: 100% (16/16), done.
remote: Total 771 (delta 17), reused 24 (delta 14), pack-reused 741
Receiving objects: 100% (771/771), 1.31 MiB | 3.11 MiB/s, done.
Resolving deltas: 100% (450/450), done.
---
Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Running: ['wget', 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2', '--output-document=/root/Model-References/PyTorch/nlp/pretraining/deepspeed-bert/data/download/wikicorpus_en/wikicorpus_en.xml.bz2']
  https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Resolving proxy-us.intel.com (proxy-us.intel.com)... 10.1.192.48
Connecting to proxy-us.intel.com (proxy-us.intel.com)|10.1.192.48|:912... connected.
Proxy request sent, awaiting response... 200 OK
Length: 20580559456 (19G) [application/octet-stream]
Saving to: ‘/root/Model-References/PyTorch/nlp/pretraining/deepspeed-bert/data/download/wikicorpus_en/wikicorpus_en.xml.bz2’
 
pretraining/deepspeed-bert/data/download/   17%[---------------->                                              ] 3.35G   4.23MB/s    eta 55m 10s
Code language: PHP (php)

Running the Model

Now that the environment has been set up, we can run the full pre-training.  For simplicity, Habana has provided a full that executes on an 8-card example with the BERT 1.5B model.   This script contains the –deepspeed run command argument and the pointer to the DeepSpeed configuration .json file.   Before running this script, let’s look at some of the key changes in the model and review what the DeepSpeed engine is doing with the model.

What is DeepSpeed and ZeRO

The DeepSpeed library itself is an optimization library that is used to manage model gradients, optimizer states and the distribution of the workload across the Gaudi accelerators, and manages the system memory.  The library abstracts the difficult aspects of large-scale training, such as parallelization, mixed precision, gradient accumulation; taking advantage of the ZeRO Optimizer. 

Using ZeRO in a DeepSpeed model is quick and easy; all that is needed to change a few configurations is the DeepSpeed configuration JSON. No model code changes are needed to enable the benefits of ZeRO on a model.

Setting up a model for DeepSpeed

There are several steps for enabling a model for DeepSpeed, in Habana’s BERT1.5B model.   For more details on how to convert a model to use the DeepSpeed library, you can refer to the DeepSpeed Getting Started.   First,  initialize the DeepSpeed model execution and distribution.  This is done by calling the deepspeed.initialize() function and the deepspeed.init_distributed() function.  Examples from the BERT1.5B are shown below:

model, optimizer, _, lr_scheduler = deepspeed.initialize(

    args=args,

    model_parameters=None if optimizer else optimizer_grouped_parameters,

    model=model,

    optimizer=optimizer,

    lr_scheduler=lr_scheduler

An important part of the init_distributed function is that the dist_backend variable is set to “hccl”, this is using Habana’s collective communications library (hccl).  This function replaces the original use of the PyTorch DistributedDataParallel() function call in a non-DeepSpeed model.

if args.use_hpu:

        import habana_frameworks.torch.hpu

        import habana_frameworks.torch.distributed.hccl

        device = torch.device("hpu")

        dist_backend = "hccl"

       

        deepspeed.init_distributed(dist_backend=dist_backend, init_method=init_method)

Additionally, the traditional backward training pass function is changed from loss.backward() to model.backward(loss) and the Optimizer is changed from optimizer.step() to model.step(); taking advantage of the optimizer built into the DeepSpeed Engine.  For the BERT1.5B the LANS optimizer is used.

To manage the runtime of the DeepSpeed model, there are two new files that are also used to set the parameters of the execution:

ds_config.json  –  the file that sets all the config parameters for the DeepSpeed execution hyperparameters, ZeRO state, and other DeepSpeed specific runtime variables.  For Habana’s BERT1.5B model, this file is deepspeed_config_bert_1.5b.json and sets the ZeRO state to use ZeRO Stage 1, sets the mixed precision to use BF16, and other variables

run_deepspeed.sh – the run script that calls the model, points to the deepspeed_config_bert_1.5b.json and initiates the run on Habana by setting the –use_hpu option in the run commands.  For Habana’s BERT1.5B model, this file is called run_bert_1.5b_8x.sh.  This script calls the run_pretraining.py script.

Running Habana’s BERT1.5B Model:

Once the initial steps have been completed, you can call the run script:

bash ./scripts/run_bert_1.5b_8x.sh

 

This will execute the model on eight Gaudi HPUs and create a set of checkpoints for use in the Fine-Tuning steps later.  

Beginning of run script:

root@ubuntu2004:~/Model-References/PyTorch/nlp/pretraining/deepspeed-bert# bash ./scripts/run_bert_1.5b_8x.sh
[runner.py:517:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --no_python --no_local_rank python -u ./run_pretraining.py --use_hpu --disable_progress_bar --optimizer=lans --use_lr_scheduler --resume_from_checkpoint --do_train --bert_model=bert-base-uncased --config_file=./scripts/bert_1.5b_config.json --json-summary=./results/bert_1.5b/dllogger.json --output_dir=./results/bert_1.5b/checkpoints --seed=12439 --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/books_wiki_en_corpus --max_seq_length 128 --max_predictions_per_seq=20 --max_steps=155000 --steps_this_run=-1 --num_steps_per_checkpoint=200 --learning_rate=0.0015 --warmup_proportion=0.05 --constant_proportion=0.25 --scheduler_degree=1.0 --log_freq=10 --deepspeed --deepspeed_config=./scripts/deepspeed_config_bert_1.5b.json
[INFO] [launch.py:139:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[INFO] [launch.py:145:main] nnodes=1, num_local_procs=8, node_rank=0
[INFO] [launch.py:158:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[INFO] [launch.py:159:main] dist_world_size=8
Distributed training with backend=hccl, device=hpu, local_rank=3
Distributed training with backend=hccl, device=hpu, local_rank=5
Distributed training with backend=hccl, device=hpu, local_rank=7
Distributed training with backend=hccl, device=hpu, local_rank=1
Distributed training with backend=hccl, device=hpu, local_rank=0
Distributed training with backend=hccl, device=hpu, local_rank=2
Distributed training with backend=hccl, device=hpu, local_rank=4
Distributed training with backend=hccl, device=hpu, local_rank=6
[INFO] [comm.py:628:init_distributed] Initializing TorchBackend in DeepSpeed with backend hccl
Using LANS
Using PolyWarmUpScheduler with args={'warmup': 0.05, 'total_steps': 155000.0, 'degree': 1.0, 'constant': 0.25}
Code language: PHP (php)

End of run script:

[INFO] [engine.py:3268:_save_zero_checkpoint] zero checkpoint saved deepspeed-run/1.7.1-85/bert_1.5b/8/2023-01-04_17-52/checkpoints/20/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
DLL 2023-01-04 18:43:28.908788 -  e2e_train_time : 2313.8388447761536  training_sequences_per_second : 63.06757577335175  final_loss : 11.55532455444336  raw_train_time : 1948.3862903118134
[INFO] [launch.py:322:main] Process 12605 exits successfully.
[INFO] [launch.py:322:main] Process 12609 exits successfully.
[INFO] [launch.py:322:main] Process 12608 exits successfully.
[INFO] [launch.py:322:main] Process 12610 exits successfully.
[INFO] [launch.py:322:main] Process 12607 exits successfully.
[INFO] [launch.py:322:main] Process 12606 exits successfully.
[INFO] [launch.py:322:main] Process 12604 exits successfully.
[INFO] [launch.py:322:main] Process 12603 exits successfully.
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/core/__init__.py:94: UserWarning: habana_frameworks.torch.core.get_device_count is deprecated. Please use habana_frameworks.torch.hpu.device_count
  warnings.warn("habana_frameworks.torch.core.get_device_count is deprecated. "
LOG_FILE   = /deepspeed-bert_1.5b-8-p1-2023-01-04_17-52.log
TOPOLOGY   = pretraining
BATCH_SIZE =
WORLD_SIZE = 8
Pretraining phase1 results
8 card(s) average sentences per second:302.5176
average dps: 37.8147
Code language: PHP (php)

Results

Below is the throughput  as well as the overall memory consumption for the BERT 1.5B param model running on 8, 16, 32, 64 and 128 devices. If the ZeRO optimizer was not used, the model would not fit in Gaudi’s memory and cause an OOM (out of memory) error.

Max system memory

 

Model # HPU Precision Input Length Output Length Throughput Latency Batch Task Framework Version
Llama 2-7B (torch.compile) 1 bf16 128 128 5820 token/sec 51.54 ms 300 text-generation Optimum Habana 1.11.1
Falcon 180B 8 bf16 128 2,048 700 token/sec 57.14 ms 40 text-generation Optimum Habana 1.11.1
Falcon-40B 2048 Tokens 8 bf16 128 2,048 92.34 token/sec 10.82 ms 1 text-generation Optimum Habana 1.11.1
Falcon-7B 8192 Tokens 1 bf16 128 8,192 118.19 token/sec 8.46 ms 1 text-generation Optimum Habana 1.11.1
GPT-J 8 bf16 128 100 628.74 token/sec 6.36 ms 4 text-generation Optimum Habana 1.11.1
StableLM-3B 1 bf16 128 2,048 250 token/sec 4 ms 1 text-generation Optimum Habana 1.11.1
StableLM-7B 1 bf16 128 2,048 128 token/sec 7.81 ms 1 text-generation Optimum Habana 1.11.1
MPT-7B 1 bf16 128 1,932 121 token/sec 8.26 ms 1 text-generation Optimum Habana 1.11.1
Bloomz 8 bf16 128 100 36.78 token/sec 27.18 ms 1 text-generation DeepSpeed 0.14.0, Optimum Habana 1.11.1
StarCoder 1 bf16 100 100 65 token/sec 15.38 ms 1 text-generation DeepSpeed 0.14.0, Optimum Habana 1.11.1
OPT 1 bf16 100 100 1120 token/sec 0.89 ms 1 text-generation Optimum Habana 1.11.1
T5-3B Summarization 1024-128 Beam4 1 bf16 1,024 128 0.94 token/sec 1063.82 ms 1 summarization Optimum Habana 1.11.1
Bert (Text Classification) 1 bf16   128 2125 token/sec 3.76 ms 8 text-classification Optimum Habana 1.11.1
Bert (Language Modeling) 1 bf16     66.64 token/sec 60.02 ms 4 language-modeling Optimum Habana 1.11.1
Bert (Question Answering) 1 bf16   384 613 token/sec 13.05 ms 8 question-answering Optimum Habana 1.11.1
StableDiffusion v2.1 (512x512) 1 bf16     1.33 images/sec 3007.51 ms 4 stable-diffusion PyTorch Lightning 2.2.4
Bart 1 bf16     6.79 token/sec 294.55 ms 2 summarization Optimum Habana 1.11.1
BridgeTower 1 bf16     321 token/sec 49.84 ms 16 constrastive-image-text Optimum Habana 1.11.1
ESMFold 1 bf16     2.97 token/sec 336.7 ms 1 protein-folding Optimum Habana 1.11.1
T5-3B Summarization Greedy 1 bf16     2.46 token/sec 406.5 ms 1 summarization Optimum Habana 1.11.1
HF-T5-Small-Translation-Greedy 1 bf16     30.85 token/sec 129.65 ms 4 translation Optimum Habana 1.11.1
Wav2vec(Audio Classification) 1 bf16     1002 token/sec 3.99 ms 4 audio-classification Optimum Habana 1.11.1
Wav2vec(Speech Recoginition) 1 bf16     16.62 token/sec 240.67 ms 4 speech-recoginition Optimum Habana 1.11.1

Table 1: BERT 1.5B LANS Pre-Training Phase 1 Throughput on First-Gen Gaudi

Next Steps

You are invited to experiment with several options for running DeepSpeed based models on Gaudi. You can use one of the following examples: