Build Secure, Scalable, and Accelerated Machine Learning Pipelines

author-image

By

Use Intel® SGX Virtual Machines in Azure Kubernetes* Services (AKS) Deployments

When deploying machine learning models into production, it is critical to ensure the security, scalability, and availability of the application. Using the Kubernetes* service on the Microsoft Azure* confidential computing platform is one of the most effective ways to ensure that each of these requirements are met.

In this tutorial, build and deploy a highly available and scalable XGBoost pipeline on Azure. This reference solution is a part of the Intel® Cloud Optimization Modules for Azure, a set of cloud native open source reference architectures that are designed to facilitate building and deploying Intel-optimized AI solutions on leading cloud providers, including Amazon Web Services (AWS)*, Azure, and Google Cloud Platform* service. Each module, or reference architecture, includes a complete instruction set and all source code published on GitHub*. You can download the code for this module on GitHub.

This solution architecture uses Docker* for application containerization and stores the image in an Azure container registry (ACR). The application is then deployed on a cluster managed by Azure Kubernetes* Services (AKS). The cluster runs on confidential computing virtual machines leveraging Intel® Software Guard Extensions (Intel® SGX). Use a mounted Azure file share for persistent data and model storage. An Azure load balancer is provisioned by the Kubernetes service that the client uses to interact with your application. These Azure resources are managed in an Azure resource group.

The deployed machine learning pipeline builds a loan default risk prediction system using Intel® Optimization for XGBoost*, Intel® oneAPI Data Analytics Library (oneDAL), and Intel® Extension for Scikit-Learn* to accelerate model training and inference. This also demonstrates how to use incremental training of the XGBoost model as new data becomes available, which aims to tackle challenges such as data shift and very large datasets. When the pipeline is run, each pod is assigned to an Intel SGX node. Figure 1 shows the cloud solution architecture.

Figure 1. Diagram of the XGBoost pipeline on Kubernetes architecture.

Configure Azure* Services

Before beginning this tutorial, ensure that you have downloaded and installed the prerequisites for the module. Then, in a new terminal window, use the following command to log into your Azure account interactively with the Azure command-line interface.

az login

Next, create a resource group that holds Azure resources for the solution. Call the resource group intel-aks-kubeflow and set the location to eastus.

# Set the names of the Resource Group and Location
export RG=intel-sgx-loan-default-app 
export LOC=eastus 
 
# Create the Azure Resource Group
az group create -n $RG -l $LOC

Use an Azure file share for persistent volume storage of the application's data and model objects. To create the file share, first create an Azure storage account named loanappstorage using the following command:

# Set the name of the Azure storage account
export STORAGE_NAME=loanappstorage
 
# Create an Azure storage account
az storage account create \
--resource-group $RG \
--name $STORAGE_NAME \
--kind StorageV2 \
--sku Standard_LRS \
--enable-large-file-share \
--allow-blob-public-access false

Next, create a new file share in your storage account named loan-app-file-share with a quota of 1,024 GiB.

# Create an Azure file share
az storage share-rm create \
--resource-group $RG \
--storage-account $STORAGE_NAME \
--name loan-app-file-share \
--quota 1024

Create an Azure container registry to build, store, and manage the container image for the application. The following command creates a new container registry named loandefaultapp.

# Set the name of the Azure container registry
export ACR=loandefaultapp
 
# Create an Azure container registry
az acr create --resource-group $RG \
--name $ACR \
--sku Standard

Log in to the registry using the following command:

az acr login -n $ACR

Next, build an application image from the Dockerfile provided in this repository and push it to the container registry. Name the image loan-default-app with the tag latest. Ensure the Dockerfile is in your working directory before running the following command.

az acr build --image loan-default-app:latest --registry $ACR -g $RG --file Dockerfile .

Now you are ready to deploy the AKS cluster with confidential computing nodes that use Intel SGX virtual machines (VM).

Intel SGX VMs allow you to run sensitive workloads and containers within a hardware-based Trusted Execution Environment (TEE). TEEs allow user-level code from containers to allocate private regions of memory to execute the code with CPU directly. These private memory regions that execute directly with CPU are called enclaves. Enclaves help protect the data confidentiality, data integrity, and code integrity from other processes running on the same nodes, as well as Azure operator. These machines are powered by 3rd generation Intel® Xeon® Scalable processors, and use Intel® Turbo Boost Max Technology 3.0 to reach 3.5 GHz.

To set up the confidential computing node pool, first create an AKS cluster with the confidential computing add-on enabled, confcom. This creates a system node pool that hosts the AKS system pods, like CoreDNS and metrics server. Executing the following command creates a node pool with a Standard_D4_v5 virtual machine. The Kubernetes version used for this tutorial is 1.25.5. Provision a standard Azure load balancer for the cluster and attach the container registry you created in the previous step, which allows the cluster to pull images from the registry.

# Set the name of the AKS cluster
export AKS=aks-intel-sgx-loan-app
 
# Create the AKS cluster
az aks create --resource-group $RG \
--name $AKS \
--node-count 1 \
--node-vm-size Standard_D4_v5 \
--kubernetes-version 1.25.5 \
--enable-managed-identity \
--generate-ssh-keys -l $LOC \
--load-balancer-sku standard \
--enable-addons confcom \
--attach-acr $ACR

Once the system node pool has been deployed, add the Intel SGX node pool to the cluster using an instance of the DCSv3 series. The name of the confidential node pool is intelsgx, which is referenced in the Kubernetes deployment manifest for scheduling application pods. Enable cluster autoscaling for this node pool with a minimum of one node and a maximum of five nodes.

# Add the Intel SGX node pool to the AKS cluster
az aks nodepool add --resource-group $RG \
--name intelsgx \
--cluster-name $AKS \
--node-count 1 \
--node-vm-size Standard_DC4s_v3 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 5

Once the Intel SGX node pool has been added, obtain the cluster access credentials and merge them into your local .kube/config file using the following command:

az aks get-credentials -n $AKS -g $RG

To ensure that the Intel SGX VM nodes were created successfully, run:

kubectl get nodes

You should see two agent nodes running that begin with the name aks-intelsgx.

To ensure that the DaemonSet was created successfully, run:

kubectl get pods -A

In the kube-system namespace, you should see two pods running that begin with the name sgx-plugin. If you see the previous pods and node running, this means that the confidential node pool has been created successfully.

Set Up the Kubernetes Resources

Now that the AKS cluster has been deployed, use the following command to set up a Kubernetes namespace for the cluster resources called intel-sgx-loan-app.

export NS=intel-sgx-loan-app
kubectl create namespace $NS

Next, create a Kubernetes secret with the Azure storage account name and account key so that the loan-app-file-share can be mounted to the application pods.

export STORAGE_KEY=$(az storage account keys list -g $RG -n $STORAGE_NAME --query [0].value -o tsv)
 
kubectl create secret generic azure-secret \
--from-literal azurestorageaccountname=$STORAGE_NAME \
--from-literal azurestorageaccountkey=$STORAGE_KEY \
--type=Opaque

Now that you created a secret for your Azure storage account, you can set up persistent volume storage for the loan default risk prediction system. A persistent volume (PV) enables the application data to persist beyond the lifecycle of the pods and allow each of the pods in your deployment to access and store data in the loan-app-file-share.

First create the persistent volume using the pv-azure.yaml file in the kubernetes directory of the repository. This creates a storage resource available in the cluster backed by your Azure file share. Then create the persistent volume claim (PVC) using the pvc-azure.yaml file, which requests 20 Gi of storage from the azurefile-csi storage class with ReadWriteMany access. Once the claim matches with the volume, they are bound together in a one-to-one mapping.

# Change working directory to the kubernetes directory 
cd kubernetes
 
# Create the Kubernetes persistent volume and persistent volume claim
kubectl create -f pv-azure.yaml -n $NS
kubectl create -f pvc-azure.yaml -n $NS

Next, create an Kubernetes external load balancer to distribute incoming API calls evenly across the available pods in the cluster and ensure that the application remains highly available and scalable. Running the following command configures an Azure load balancer for the cluster with a new public IP. This IP address is used to send requests to the application’s API endpoints for data processing, model training, and inference.

kubectl create -f loadbalancer.yaml -n $NS

Now you are ready to deploy the first application pod. To create and manage our application pods, use a Kubernetes deployment. A Kubernetes deployment is a Kubernetes resource that allows you to declaratively manage a set of replica pods for a given application, ensuring that the desired number of pods are running and available at all times.

In the pod template of the deployment spec, specify that the pods only be scheduled on a node in the Intel SGX node pool. If an SGX node is not available, the cluster automatically scales up to add an additional SGX node. Also, specify the directory in your pod where the volume is mounted, which is in the loan_app/azure-file directory. When the pod is scheduled, the cluster inspects the persistent volume claim to find the bound volume and mount the Azure file share to the pod.

Note Before executing the following command, update the container image field in the deployment.yaml file with the name of the Azure container registry, repository, and tag. For example, loandefaultapp.azurecr.io/loan-default-app:latest.

kubectl create -f deployment.yaml -n $NS

To check that the deployment was created successfully, you can use the kubectl get command:

kubectl get all -n $NS

Your output should be similar to:

NAME                               READY   STATUS    RESTARTS   AGE
pod/sgx-loan-app-5948f49746-zl6cf   1/1    Running   0          81s

NAME                              TYPE           CLUSTER-IP   EXTERNAL-IP      PORT(S)          AGE
service/loan-app-load-balancer    LoadBalancer   10.0.83.32   20.xxx.xxx.xxx   8080:30242/TCP   3m40s
 
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/sgx-loan-app  1/1     1            1           81s
 
NAME                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/sgx-loan-app-5948f49746   1         1         1       81s

Finally, create a Kubernetes horizontal pod autoscaler (HPA) for your application that maintains between one and five replicas of the pods. The HPA controller increases and decreases the number of replicas (by updating the deployment) to maintain an average CPU use of 50%. Once CPU use by the pods goes above 50%, a new pod is automatically scheduled.

kubectl create -f hpa.yaml -n $NS

Deploy the Application

Now that the Azure resources and Kubernetes infrastructure has been set up, you can send requests to the application’s API endpoints. The three main endpoints in our application are data processing, model training, and inference. The credit risk data set used in the application is obtained from Kaggle* and synthetically augmented for testing and benchmarking purposes.

Data Processing

The data processing endpoint receives the original credit risk CSV file, creates a data preprocessing pipeline, and generates training and test sets of the size specified. The train and test data, as well as our preprocessor, are stored in the Azure file share.

The data processing endpoint accepts the following four parameters:

  • az_file_path: The volume mount path to the Azure file share for object storage.
  • data_directory: The directory in the file share where processed data should be stored.
  • file: The original credit risk CSV file located in the working directory.
  • size: The desired size of final dataset, default 4M rows.

You can make a call to this endpoint using the following command with the external IP address that was created by the Azure load balancer.

curl <external-IP>:8080/data_processing \
-H "Content-Type: multipart/form-data" \
-F az_file_path=/loan_app/azure-fileshare \
-F data_directory=data \
-F file=@credit_risk_dataset.csv \
-F size=4000000 | jq

In the loan-app-file-share, you should see a new directory named data with the processed training and test sets as well as the data preprocessor.

Model Training

Now you are ready to train the XGBoost model. The model training endpoint begins training the model or continues training the model, depending on the value of the continue_training parameter. The XGBoost classifier and the model validation results are then be stored in the Azure file share.

This endpoint accepts the following six parameters:

  • az_file_path: The volume mount path to the Azure file share.
  • data_directory: The directory in the file share where processed data is stored.
  • model_directory: The directory in the file share for model storage.
  • model_name: The name to store the model object.
  • continue_training: The XGBoost model parameter for training continuation.
  • size: The size of the processed dataset.

You can make a call to the model training endpoint using the following command:

curl <external-IP>:8080/train \
-H "Content-Type: multipart/form-data" \
-F az_file_path=/loan_app/azure-fileshare \
-F data_directory=data \
-F model_directory=models \
-F model_name=XGBoost \
-F continue_training=False \
-F size=4000000 | jq

In the loan-app-file-share, you should see a new directory named models with the XGBoost model saved as .joblib object and the model performance results.

Continue Training the XGBoost Model

To demonstrate training continuation of the XGBoost model, first process a new batch of 1,000,000 rows of data using the following command:

curl <external-IP>:8080/data_processing \
-H "Content-Type: multipart/form-data" \
-F az_file_path=/loan_app/azure-fileshare \
-F data_directory=data \
-F file=@credit_risk_dataset.csv \
-F size=1000000 | jq

Then call the model training endpoint with the continue_training parameter set to True.

curl <external-IP>:8080/train \
-H "Content-Type: multipart/form-data" \
-F az_file_path=/loan_app/azure-fileshare \
-F data_directory=data \
-F model_directory=models \
-F model_name=XGBoost \
-F continue_training=True \
-F size=1000000 | jq

Inference

In the final step of the pipeline, the trained XGBoost classifier is used to predict the likelihood of a loan default in the inference endpoint. This endpoint retrieves the trained XGBoost model from the Azure file share and converts it into a daal4py format to perform model inference. The inference results are then stored in the loan-app-file-share, containing the predicted probabilities and a corresponding label of True when the probability is greater than 0.5 or False otherwise.

The inference endpoint accepts a CSV file with sample data in the same format as the credit risk data, in addition to the following parameters:

  • file: The sample CSV file for model inference.
  • az_file_path: The volume mount path to the Azure file share.
  • data_directory: The directory in the file share where the data preprocessor is stored.
  • model_directory: The directory in the file share where the trained XGBoost model is stored.
  • model_name: The name of the stored model object.
  • sample_directory: The directory in the file share to save the prediction results.

You can make a call to the inference endpoint using the following command:

curl <external-IP>:8080/predict \
-H "Content-Type: multipart/form-data" \
-F file=@sample.csv \
-F az_file_path=/loan_app/azure-fileshare \
-F data_directory=data \
-F model_directory=models \
-F model_name=XGBoost \
-F sample_directory=samples| jq

In the loan-app-file-share, you should see a new directory named samples with the daal4py prediction results saved in a CSV file.

Summary

This tutorial demonstrated how to build a highly available and scalable Kubernetes application on the Azure cloud, using Intel SGX confidential computing nodes. The solution architecture implemented the Loan Default Risk Prediction AI Reference Kit, which built an XGBoost classifier capable of predicting the probability that a loan results in default, and used Intel optimizations in XGBoost and oneDAL. The tutorial further demonstrated how an XGBoost classifier can be updated with new data and trained incrementally, which aims to tackle challenges such as data shift and very large datasets.

Next Steps

View the full performance results of the Loan Default Risk Prediction AI Reference Kit.

Notices & Disclaimers

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.