Install Kubeflow with Confidential Computing VMs on Microsoft Azure*

Leverage secure and confidential virtual machines (VMs) with Intel® Software Guard Extensions in Kubeflow Deployments

Get the Latest on All Things CODE

author-image

By

Many machine learning applications must ensure the confidentiality and integrity of the underlying code and data. Until recently, security has primarily focused on encrypting data that is at rest in storage or being transmitted across a network, but not on data that is in use. Intel® Software Guard Extensions (Intel® SGX) provide a set of instructions that allow you to securely process and preserve the application code and data. Intel SGX does this by creating a trusted execution environment (TEE) within the CPU. TEEs allow user-level code from containers to allocate private regions of memory, called enclaves, to execute the application code directly with the CPU.

With the Microsoft Azure* confidential computing platform, you can deploy both Windows* and Linux* virtual machines leveraging the security and confidentiality provided by Intel SGX. These machines are powered by 3rd Generation Intel® Xeon® Scalable processors and use Intel® Turbo Boost Max Technology 3.0 to reach 3.5 GHz. This tutorial will walk through how to set up Intel SGX nodes on an Azure Kubernetes* Service (AKS) cluster. We will then install Kubeflow*, the machine learning toolkit for Kubernetes that you can use to build and deploy scalable machine learning pipelines.

This module is a part of the Intel® Cloud Optimization Modules, a set of cloud-native open source reference architectures that are designed to facilitate building and deploying Intel-optimized AI solutions on leading cloud providers, including Amazon Web Services (AWS)*, Microsoft Azure, and Google Cloud Platform*.

Each module, or reference architecture, includes a complete set of instructions, all source code published on GitHub*, and a video walk-through. Before starting this tutorial, ensure that you have downloaded and installed the prerequisites. Then from a new terminal window, use the command below to log into your account interactively with the Microsoft Azure command-line interface.

az login

Next, create a resource group that will hold the Azure resources for our solution. We will call our resource group intel-aks-kubeflow and set the location to eastus

# Set the names of the Resource Group and Location
export RG=intel-aks-kubeflow 
export LOC=eastus 

# Create the Azure Resource Group
az group create -n $RG -l $LOC

To set up the AKS cluster with confidential computing nodes, we will first create a system node pool and enable the confidential computing add-on. The confidential computing add-on will configure a DaemonSet for the cluster that will ensure each eligible VM node runs a copy of the Azure device plugin pod for Intel SGX.

The command below will provision a node pool using a standard virtual machine from the Dv5 series, which is a 3rd Gen Xeon CPU. This is the node that will host the AKS system pods, like CoreDNS and metrics-server. The following command will also enable managed identity for the cluster and provision a standard Azure Load Balancer. If you have an Azure Container Registry that you have already set up, you can attach it to the cluster by adding the parameter --attach-acr <registry-name>.

# Set the name of the AKS cluster
export AKS=aks-intel-sgx-kubeflow 

# Create the AKS system node pool
az aks create --name $AKS \
--resource-group $RG \
--node-count 1 \
--node-vm-size Standard_D4_v5 \
--enable-addons confcom \
--enable-managed-identity \
--generate-ssh-keys -l $LOC \
--load-balancer-sku standard 

Once the system node pool has been deployed, we will add the Intel SGX node pool to the cluster. The following command will provision two four-core Intel SGX nodes from the Azure DCSv3 series. A node label has been added to this node pool with the key intelvm and the value sgx. This key/value pair will be referenced in the Kubernetes nodeSelector to assign the Kubeflow pipeline pods to an Intel SGX node.

az aks nodepool add --name intelsgx \
--resource-group $RG \ 
--cluster-name $AKS \ 
--node-vm-size Standard_DC4s_v3 \
--node-count 2 \
--labels intelvm=sgx 

Once the confidential node pool has been set up, obtain the cluster access credentials and merge them into your local .kube/config file using the command below.

az aks get-credentials -n $AKS -g $RG

We can verify that the cluster credentials were set correctly by executing the command below. This should return the name of your AKS cluster.

kubectl config current-context

To ensure that the Intel SGX VM nodes were created successfully, run:

kubectl get nodes 

You should see two agent nodes running beginning with the name aks-intelsgx.

To ensure that the DaemonSet was created successfully, run:

kubectl get pods -A

In the kube-system namespace, you should see two pods running that begin with the name sgx-plugin. If you see the above pods and node pool running, this means that your AKS cluster is now ready to run confidential applications, and we can begin installing Kubeflow.

Install Kubeflow on an Azure Kubernetes Services (AKS) Cluster

To install Kubeflow on an AKS cluster, first clone the Kubeflow Manifests GitHub repository.

git clone https://github.com/kubeflow/manifests.git

Change the directory to the newly cloned manifests directory.

cd manifests

As an optional step, you can change the default password to access the Kubeflow Dashboard using the command below:

python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

Navigate to the config-map.yaml in the dex directory and paste the newly generated password in the hash value of the configuration file at around line 22.

nano common/dex/base/config-map.yaml 

    staticPasswords:
    - email: user@example.com
      hash: 

Next, change the Istio* ingress gateway from a ClusterIP to a LoadBalancer. This will configure an external IP address that we can use to access the dashboard from our browser.

Navigate to common/istio-1-16/istio-install/base/patches/service.yaml and change the specification type to LoadBalancer at around line 7.

apiVersion: v1
kind: Service
metadata:
  name: istio-ingressgateway
  namespace: istio-system
spec:
  type: LoadBalancer

For AKS clusters, we also need to disable the AKS admission enforcer from the Istio webhook. Navigate to the Istio install.yaml and add the following annotation at around line 2694.

nano common/istio-1-16/istio-install/base/install.yaml

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: istio-sidecar-injector
  annotations:
    admissions.enforcer/disabled: 'true'
  labels:

Next, we will update the Istio gateway to configure the Transport Layer Security (TLS) Protocol. This will allow us to access the dashboard over HTTPS. Navigate to the kf-istio-resources.yaml and at the end of the file, at around line 14, paste the following contents:

nano common/istio-1-16/kubeflow-istio-resources/base/kf-istio-resources.yaml

    tls:
      httpsRedirect: true
  - port:
      number: 443
      name: https
      protocol: HTTPS
    hosts:
    - "*"
    tls:
      mode: SIMPLE
      privateKey: /etc/istio/ingressgateway-certs/tls.key
      serverCertificate: /etc/istio/ingressgateway-certs/tls.crt

Now we are ready to install Kubeflow. We will use kustomize to install the components with a single command. You can also install the components individually.

while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Note: This may take several minutes for all components to be installed and some may fail on the first try. This is inherent to how Kubernetes and kubectl work (e.g., custom resources must be created after CustomResourceDefinitions become ready). The solution is to simply re-run the command until it succeeds.

Once you have installed the components, verify that all of the pods are running by using:

kubectl get pods -A

Optional: If you created a new password for Kubeflow, restart the dex pod to ensure it is using the updated password.

kubectl rollout restart deployment dex -n auth

Finally, create a self-signed certificate for the TLS Protocol using the external IP address from the Istio load balancer. To get the external IP address, use the following command:

kubectl get svc -n istio-system

Create the Istio certificate and copy the contents below:

nano certificate.yaml  

apiVersion: cert-manager.io/v1 
kind: Certificate 
metadata: 
  name: istio-ingressgateway-certs 
  namespace: istio-system 
spec: 
  secretName: istio-ingressgateway-certs 
  ipAddresses: 
    - <Istio IP address>  
  isCA: true 
  issuerRef: 
    name: kubeflow-self-signing-issuer 
    kind: ClusterIssuer 
    group: cert-manager.io 

Then, apply the certificate:

kubectl apply -f certificate.yaml

Verify that the certificate was created successfully:

kubectl get certificate -n istio-system

Now we are ready to launch the Kubeflow Dashboard. To log into the dashboard, type the Istio IP address into your browser. When you first access the dashboard, you may get a warning. This is because we are using a self-signed certificate. You can replace this with an SSL CA certificate if you have one, or click on Advanced and Proceed to the website. The DEX login screen should appear. Enter your username and password. The default username for Kubeflow is user@example.com and the default password is 12341234.

Summary

In this tutorial, we went over how to install Kubeflow on an Azure Kubernetes Services cluster with a confidential computing node pool. You are now ready to build and deploy scalable machine learning pipelines on Kubeflow. In the next tutorial, we will go over how to set up your Kubeflow Pipelines to ensure the pods are scheduled onto an Intel SGX VM node.

Next Steps