Overview
Create an end-to-end AI/ML (Artificial Intelligence/Machine Learning) pipeline demonstrating Quantization-Aware Training and Inference using a wide array of Intel® software including the OpenVINO™ toolkit. The workflow is deployed through Helm* by using microservices and Docker* images.
Select Configure & Download to download the workflow.
- Time to Complete: 10 minutes
- Programming Language: Python*
- Available Software: OpenVINO™ toolkit, Hugging Face Optimum* Intel® interface, Docker*, Kubernetes*, Helm*
How It Works
Figure 1: Flow Diagram
The workflow executes as follows:
-
The Pipeline triggers Quantization-Aware Training of a Natural Language Processing (NLP) model from Hugging Face. The output of this container is the INT8 optimized model stored on a local/cloud storage.
-
Once the model is generated, then inference applications can be deployed with one of the following APIs:
- Inference using Hugging Face API with Optimum Intel
- Inference using Hugging Face API with Optimum ONNX Runtime
- Deploy the model using OpenVINO™ Model Server and send in grpc requests
Get Started
Prerequisites
You need a Kubernetes* cluster that meets the Edge Node and Software requirements described below.
Edge Node Requirements
-
One of the following processors:
-
Intel® Xeon® Platinum 8370C Processor @ 2.80GHz (16 vCPUs).
- At least 64 GB RAM.
-
Intel® Xeon® Processor with NVIDIA* GPU.
- At least 112 GB RAM.
-
-
At least 256 GB hard drive.
-
An Internet connection.
-
Ubuntu* 20.04 LTS.
Software Requirements
-
Any flavor of Kubernetes variations.
This project uses Rancher* K3S* installation.
-
Helm installation on master node.
Simple commands are listed below. For details, see Helm installation instructions.
-
This project uses the
bert-large-uncased-whole-word-masking-finetuned-squad
model forQuestion Answering
use case through quantization-aware training and inference. Training and inference scripts are included in the respective folders.
Step 1: Install and Run the Workflow
Download Source Code
Choose one of the following options:
-
Select Configure & Download to download the workflow.
-
Or, run the command:
Modify Helm Chart Values
Edit the helmchart/qat/values.yaml
file as follows:
-
Replace
<current_working_gitfolder>
undermountpath:
with the current working repo directory.NOTE: Relative paths do not work with Helm.
-
Edit the
helmchart/qat/values.yaml
file for the<train_node>
and<inference_node>
values under thenodeselector
key.Pick any of the available nodes for training and inference with the nodename of this command.
values.yaml file
-
Edit
helmchart/qat/values.yaml
file with higher values ofMAX_TRAIN_SAMPLES
andMAX_EVAL_SAMPLES
parameters for better fine-tuning of data. Default value is 50 samples. -
Find details on all the parameters in the Parameters Table.
Step 2: Run Helm Charts
This section contains step-by-step details to install specific Helm charts with both training and inference. Learn more about Helm commands.
Use Case 1: QAT with Inference using OpenVINO™ through Optimum* Intel® Interface
We have options to run inference in two ways:
-
Using Input CSV file (Default).
-
Using Arguments (Optional) - Question and Context Argument. We need to edit
deployment_optimum.yaml
to run inference based on question and context argument. We need to pass question and context as below indeployment_optimum.yaml
:
The Training pod is deployed through pre_install_job.yaml
. Default training happens on CPU. To enable GPU for training, refer to Enable NVIDIA GPU for training.
The Inference pod is deployed through deployment_optimum.yaml
.
The <time>
value has the format nnns
, where s indicates seconds. For the above hardware configuration and with MAX_TRAIN_SAMPLES=50
, we recommend you set the <time>
value as 480s
. You can increase the value for reduced hardware configuration. Refer to Troubleshooting in case of timeout errors.
Confirm if the training has been deployed.
If the training pod is in "Running" state then the pod has been deployed. Otherwise, check for any errors by running the command:
The training pod will be in "Completed" state after it finishes training. Refer to the Training Output section for details.
Once the training is completed, the inference pod gets deployed automatically. The inference pod uses OpenVINO™ Runtime as a backend to Hugging Face APIs and takes in model generated from training pod as input.
Optimum Inference Output
-
Input to the inference pod will be taken from the
openvino_optimum_inference/data
folder. -
Output of the OpenVINO™ Integration with Optimum* inference pod will be stored in the
openvino_optimum_inference/logs.txt
file. -
View the logs using:
Use Case 2: QAT with Inference using OpenVINO™ Model Server
The Training pod is deployed through pre_install_job.yaml
. The OpenVINO™ Model Server pod is deployed through deployment_ovms.yaml
.
Copy deployment_ovms.yaml
from helmchart/deployment_yaml
folder into helmchart/qat/templates
. Make sure there is only one deployment_*.yaml
file in the templates folder for single deployment.
Follow the same instructions as Use Case 1.
OpenVINO™ Model Server Inference Output
-
OpenVINO™ Model Server deploys optimized model from training container. View the logs using the command:
-
The client can send in grpc request to server using OpenVINO™ APIs.
Find more details on the OpenVINO™ Model Server Adapter API.
-
Run a sample OpenVINO™ client application as below.
Open a new terminal to run the client application. Change the
<hostname>
in the command below before running.<hostname>
hostname of the node where the OpenVINO™ Model Server has been deployed.
In this case, hostname should be srdev
.
Run Client Application to Send Request to OpenVINO™ Model Server
This will download inference script from open_model_zoo
and serve inference using ovms server.
The client application will trigger a interactive terminal to ask questions based on the context for https://en.wikipedia.org/wiki/Bert_(Sesame_Street)
as this is given as input. Please input a question.
Use Case 3: QAT with Inference using OpenVINO™ Execution Provider through Optimum* ONNX Runtime
The Training pod is deployed through pre_install_job.yaml
.
The Optimum ONNX Runtime with OpenVINO™ Execution Provider pod is deployed through deployment_onnx.yaml
.
Copy deployment_onnx.yaml
from helmchart/deployment_yaml
folder into helmchart/qat/templates
. Make sure there is only one deployment_*.yaml
file in the templates folder.
Follow the same instructions as Use Case 1.
Onnxruntime Inference Output
-
Input to the inference pod will be taken from the
onnxovep_optimum_inference/data
folder. -
Output of the onnxruntime inference pod will be stored in the
onnxovep_optimum_inference/logs.txt
file. -
View the logs using:
Use Case 4: Inference Only
Before triggering the inference, make sure you have access to the model file and also edit the model path in the qat/values.yaml
file.
Keep only one deployment_*.yaml
file in the qat/templates
folder to deploy just one inference application.
-
For Huggingface API with OpenVINO™ Intel, use
deployment_optimum.yaml
. Model format acceptable is pytorch or IR.xml -
For OpenVINO™ model server, use
deployment_ovms.yaml
. Model format acceptable is IR.xml -
For Optimum ONNX Runtime with OpenVINO-EP use
deployment_onnx.yaml
file. Model format acceptable is .onnx
To run inference, use the following commands:
Clean Up
After you run a use case, clean up resources using the command:
Useful Commands
Uninstalling Helm: (If required)
Uninstalling K3S: (If required)
Refer to Steps to uninstall Rancher K3S.
Step 3: Evaluate Use Case Output
View the pods that are deployed through Helm Chart with the command below:
Take the pod_name from the list of pods, run:
If the pods are in completed state, it means they have completed the running task.
Training Output
- Output of the training container will be an optimized INT8 model generated in the
quantization_aware_training/model
folder. - Verify if all the model files are generated in the
<output>
folder. - A
logs.txt
file is generated to store the logs of the training container which will have accuracy details.
Inference Output
- Output of the inference will be inference time and the answer to the question pertraining to a context file that is given as input
- Log file is generated named logs.txt in the inference folder
Set Up Azure Storage (Optional)
Use Azure† Storage for multi-node Kubernetes setup if you want to use the same storage across all the nodes.
Azure References
Setup Steps
-
Open Azure CLI terminal on Azure Portal.
-
Create a resource group:
-
Create Storage Account:
-
Create Storage Key:
-
Create a file share:
-
Create a mount point:
-
Mount the share:
Use Azure Storage in Helm Chart
- Clone the git_repo in /mnt/MyAzureFileShare and make it as your working directory.
- Edit
<current_working_directory>
in./helmchart/qat/values.yaml
file to reflect the same. - All other instructions will be same as in above steps to install the Helm chart and trigger the pipeline.
- Once the training is completed, you can view the Azure Portal and check in your fileshare that the model has been generated.
Learn More
To continue your learning, see the following guides and software resources:
- Hugging Face Optimum Intel Interface
- Neural Network Compression Framework (NNCF)
- Hugging Face Transformers training pipelines
- OpenVINO™ Execution Provider
Troubleshooting
Connection Refused
If you encounter a connection refused message as shown below:
Set the environment variable:
Helm Installation Failed
If you see this error message:
Run the command:
Then install it again:
Helm Timeout
If the training is taking a long time, you may see a timeout error during helm install command, similar to the text below:
Workaround 1
Based on the system performance, add --timeout <seconds>
to the helm command:
The <time>
value has the format nnns
, where s indicates seconds. For the above hardware configuration and with MAX_TRAIN_SAMPLES=50
, we recommend you set the <time>
value as 480s
. Increase the timeout if you need to finetune on the whole dataset.
Workaround 2
-
Even if Helm issues an error, the training pod will get schedule and will keep running and finish its job. Verify
kubectl logs <training_pod>
when the pod is completed. -
Run the command:
-
Install the qatchart with just inference as training has completed:
Clean Up Resources
Remove the Helm chart:
Delete any pods of jobs after the execution:
Support Forum
If you're unable to resolve your issues, contact the Support Forum.
† You are responsible for payment of all third-party charges, including payment for use of Microsoft Azure services.
For the most up-to-date information on Microsoft® Azure products, see the Microsoft Azure website.