Intel® Trust Domain Extensions (Intel® TDX) Performance Analysis Reference Documentation

ID 828833
Updated 7/30/2024
Version 1.0
Public

Authors

  • Shiny Sebastian

  • Chao Gao

  • Farrah Chen

author-image

By

Introduction

This document should be treated as a reference document for reporting performance impact of Intel® Trust Domain Extensions (Intel® TDX) on Kernel-based Virtual Machine (KVM) and should be used in conjunction with other official performance recommendations from KVM/Linux. Most of the recommendations and tunings mentioned in the document are specific to KVM, and hence can also be applied to legacy guests with confidential computing disabled.

Intent and methodology behind performance analysis of any workload or use case would vary depending on the scenario of interest. Scenario that this document considers for performance comparison is - maximum throughput achievable by the workload in an operating environment (example, guest for a guest-based test, or platform for a host-based test) while meeting the SLA (service level agreements) for the workload (example, response time <1ms for Redis). This generally implies that the resource utilization of the operating environment is maximized. For example, either all or respective CPUs are saturated, or I/O devices like network cards or disks are saturated or memory BW is bottlenecked.

Also, performance impact is generally quantified as the impact on workload throughput and/or latency and also the impact on system resource utilization/characteristics. In this document we provide basic guidelines for optimal configuration of the operating environment, basic sanity checks for Intel® TDX performance and metrics to report performance impact.

How to use this document

We recommend following all steps for tests evaluating host and guest performance (for example, a standalone bench top server or a server in a private cloud). We also recommend following steps 4 and 5 for test evaluating only guest performance (for example, a guest in a cloud environment).
 

  1. Configure the host to enable Intel TDX

  2. Configure and start the guest

  3. Tuning the host for optimal performance:

    a. If the guest is small enough to be contained within a NUMA* node, choose a physical NUMA node to run the guest on.

    b. Setup all guest devices, including disk and network, on the above NUMA node. (Section 3.3)

    c. Configure the memory allocation of each guest from the same NUMA node. Device Locality Check

    d. Pin Virtual CPUs (vCPUs) and processes of the guest and device interrupts on separate physical CPUs on the same NUMA node to avoid any resource contention. Each can be configured to one physical CPU or a subset of CPUs. For example: vCPUs of the guest (Section 3.1), iothreads of the guest for disk (Section 3.4), vhosts of the guest for network (Section 3.6), interrupts of devices (Section 3.5), are each pinned to a separate physical CPU or a subset of physical CPUs for each type to ensure optimal resource utilization and minimum resource contention.

  4. Tuning the guest for optimal performance:

    a. Pin the processes and interrupts within the guest on separate vCPUs or a subset for optimal performance. For example, application processes (Section 3.7) and interrupts or softirq (Section 3.5)

    b. However, if the guest is large and spans multiple NUMA nodes, configure NUMA locality of the application and interrupts inside the guest and the device/iothreads/vhosts in the host for optimal performance.

    Pinning processes and interrupts to separate CPUs, improves performance by reducing resource contention and overhead due to context switching. However, lesser cache locality amongst processes for instance network interrupts and network intensive application may cause losing some performance..

  5. Basic sanity checks for Intel TDX performance:

    a. Run Intel® Memory Latency Checker (Intel® MLC) within the guest (Section 4.1).

    b. Measure Intel® MLC idle latencies and compare them in different configurations as in (Section 5). Ensure they meet the expected latencies. If not repeat steps 3 and 4 above.

  6. To comprehend CPU overhead due to Intel TDX and for debug purposes, collect system metrics both in the guest and host as required (Section 6) while running the workload. This is not required for Intel MLC tool (Section 4.1).

  7. Post process the data and report the results (Section 7 and 8).

When reporting experiment results, please include results from at least 3 runs to quantify r2r variation and ensure consistency.

Please consult with Intel for any further guidance.

1.0 Host Configuration for Reference


Enable Intel TDX-related BIOS settings, install Intel TDX-supported kernel & QEMU* on host and setup an Intel TDX guest image.

For instance:
 

Generally, the QEMU binary is qemu-system-x86_64, but in CentOS*, it is /usr/libexec/qemu-kvm. We consider qemu-system-x86_64 for example in this article.

2.0 Guest configuration


Please consult with Intel for any QEMU command line modifications below.

For the rest of the document, we have considered a sample guest size of 8 vCPUs and 32GB memory.

2.1 Legacy VM

"Total Memory Encryption Bypass" option in the BIOS enables running a legacy non-confidential guest on the same host as an Intel TDX guest, minimizing performance impact. However, an administrator can disable all Intel TDX-related BIOS knobs mentioned in Step 3 for a fully non-confidential compute environment.

Below is a sample reference command to run from the host, where the guest is allocated on NUMA node 0, debug-threads are enabled to help identify and affinitize qemu/vhost processes to physical CPUs, PMU is disabled, cache is set to none to prevent I/O from the guest being cached on host, iothreads are assigned for each disk and AIO is set to "native" to use kernel asynchronous I/O. Some parameter recommendations highlighted in blue below are for performance reasons. Please set "img" to the absolute path of the guest image in the test platform, and change other parameters of the command below appropriately.

#!/bin/bash
img=/home/tdx/tdx_guest.qcow2
numactl -N 0 -m 0 \
    qemu-system-x86_64 \
    -accel kvm \
    -name vmx,process=vmx,debug-threads=on \
    -cpu host,pmu=off -smp 8 \
    -m 32G \
    -object memory-backend-ram,prealloc=on,size=32G,id=ram1 \
    -object iothread,id=iothread0 \
    -drive file=$img,if=none,id=virtio-disk0,format=qcow2,cache=none,aio=native \
    -device virtio-blk-pci,drive=virtio-disk0,bootindex=0,iothread=iothread0 \
    -device virtio-net-pci,netdev=nic0 \
    -netdev user,id=nic0,hostfwd=tcp::10022-:22 \
    -bios /usr/share/qemu/OVMF.fd \
    -serial file:/tmp/vmx_serial.log \
    -nographic -vga none -nodefaults \
    -daemonize

2.2 Intel TDX guest

Below is a sample reference command that should be run from the host to start a Intel TDX guest, where the guest is allocated on NUMA node 0, debug-threads are enabled to help identify and affinitize qemu/vhost processes to physical CPUs, PMU is disabled, cache is set to none to prevent I/O from the guest being cached on host, iothreads are assigned for each disk and AIO is set to native to use kernel asynchronous I/O. To enable Intel TDX, memory-encryption is set to "tdx".

As in the previous section, some parameter recommendations are highlighted in blue for performance and Intel TDX configuration reasons. Please set img to the absolute path of the guest image in the test platform and change other parameters of the command below appropriately.

#!/bin/bash
img=/home/tdx/tdx_guest.qcow2
numactl -N 0 -m 0 \
    qemu-system-x86_64 \
    -accel kvm \
    -name tdxvm,process=tdxvm,debug-threads=on \
    -object tdx-guest,id=tdx \
    -cpu host,pmu=off -smp 8 \
    -m 32G \
    -object memory-backend-ram,size=32G,prealloc=on,id=ram1,private=on \
    -machine q35,hpet=off,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1 \
    -object iothread,id=iothread0 \
    -drive file=$img,if=none,id=virtio-disk0,format=qcow2,cache=none,aio=native \
    -device virtio-blk-pci,drive=virtio-disk0,iothread=iothread0 \
    -device virtio-net-pci,netdev=nic0 \
    -netdev user,id=nic0,hostfwd=tcp::10022-:22 \
    -bios /usr/share/qemu/OVMF.fd \
    -serial file:/tmp/tdvm_serial.log  \
    -nographic -vga none -nodefaults \
    -daemonize

Note that future QEMU versions are trending to use a different cmdline to launch Intel TDX guests. If above cmdline fails with "Invalid parameter \'private\' ", please replace the 2 lines:

-object memory-backend-ram,size=32G,prealloc=on,id=ram1,private=on \
-machine q35,hpet=off,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1 \

with:

-object memory-backend-ram,size=32G,prealloc=on,id=ram1 \
-machine q35,hpet=off,kernel_irqchip=split,confidential-guest-support=tdx,memory-backend=ram1 \

3.0 Host/Guest Tunings for Reference


We recommend the following for better and consistent performance.
 

  • To prevent CPU resource contention, we recommend pinning processes and interrupts on separate (or a subset of) CPUs, either physical CPUs (pCPUs) or virtual CPUs (vCPUs) in host vs guest respectively, depending on your environment.
  • For optimal performance it is recommended to have NUMA locality for all entities like Guest vCPUs, guest memory, IO device, device interrupts, IO threads and vhost threads. This helps prevent performance loss due to cross-socket increase in latencies and any bandwidth limitations.
  • Physical CPU0 (i.e core or thread 0) is usually by default used for interrupt handling by several devices. Hence, we recommend process or interrupt pinning to avoid CPU0 and for instance can start from physical CPU1(set cpu index according to NUMA locality).
  • To reduce potential run-to-run variations, we recommend turning off sleep states (C states) and power states (P states) in the BIOS. However, if there is a need to run at higher CPU frequencies, user can enable Turbo in BIOS, and for more consistent results, can set the CPU frequency using tools like cpupower in host. For example, to set CPU frequency at 3GHz:
cpupower frequency-set -d 3G -u 3G**

Then use tools/commands like "turbostat" to confirm the CPU frequency is running at the value previously set.

In the following part of this section, we share sample commands for reference to bind each process and interrupt to an individual pCPU or vCPU depending on the operating environment. User can modify the commands to bind them to a subset of pCPUs or vCPUs as well depending on availability of resources to minimize resource contention.

3.1 vCPU Pinning

(applicable for host)

Sample command to pin QEMU process is shown below. Please set <cpu_index> and <vm_name> appropriately, where <cpu_index> stands for pCPU from which binding starts, and <vm_name> is the name assigned to the VM in qemu command with "-name xxx" as mentioned in Section 2.1 and 2.2.

#!/bin/bash
n=<cpu_index>
vm=<vm_name>
gpid=`ps -ef | grep qemu | grep $vm | awk '{print $2}'`
/usr/bin/taskset -cp $n $gpid
n=`expr $n + 1`
vcpu=`ps -ef|grep qemu|grep $vm|awk -F "smp" '{print $2}'|awk '{print $1}'`
vcpu=`expr $vcpu - 1`
for i in $(seq 0 1 $vcpu); do
    pid=`ps -T -p $gpid | grep "CPU $i" | grep -v grep | awk '{print $2}'`
    echo "Pin thread id ${pid}: on cpu $n "
    /usr/bin/taskset -cp $n $pid
    n=`expr $n + 1`
done

3.2 Memory Pinning

(applicable for both host and guest if multi-NUMA)

Allocate the memory of the guest on the same physical NUMA node as the vCPUs of the guest.

For instance, use "numactl -N 0 -m 0 <cmdline>" for the QEMU command line to start the guest and its vCPUs on NUMA node 0 and to also allocate its memory from the same node. Refer to Section 2.1 and 2.2 for a sample qemu command.

In a multi-NUMA environment where the guest spans over multiple CPU sockets, above command can be used inside the guest to allocate memory for the application from the appropriate/optimal NUMA node.

3.3 Device Locality Check

(applicable for host and guest if multi-NUMA)

For device intensive performance tests, that use disk or network IO for instance, the guest should reside on the same NUMA node as the device for optimal performance. Sample command to identify the NUMA node of a network device "ens3f0np0":

cat /sys/class/net/ens3f0np0/device/numa_node

If the guest is configured with IOthreads, refer to sample command below to pin IOthreads. Please set <cpu_index> and <vm_name> appropriately, where <cpu_index> stands for pCPU from which the binding starts, and <vm_name> is the name assigned to the VM in qemu command with "-name xxx" as mentioned in Section 2.1 and 2.2. Sample script:

#!/bin/bash
n=<cpu_index>
vm=<vm_name>
gpid=`ps -ef | grep qemu | grep $vm | awk '{print $2}'`
iothreadnum=`ps -ef | grep 'iothread' | grep -v grep | wc -l`
iothreadnum=`expr $iothreadnum - 1`
for i in $(seq 0 1 $iothreadnum); do
    local pid=`ps -T -p $gpid | grep "iothread$i" | grep -v grep | awk '{print$2}'`
    /usr/bin/taskset -cp $n ${pid}
    n=`expr $n + 1`
done

3.4 IOthread pinning

(applicable for both host and guest)

If the guest is configured with IOthreads, refer to sample command below to pin IOthreads. Please set <cpu_index> and <vm_name> appropriately, where <cpu_index> stands for pCPU from which the binding starts, and <vm_name> is the name assigned to the VM in qemu command with “-name xxx” as mentioned in section 2.1 and 2.2. Sample script:

#!/bin/bash
n=<cpu_index>
vm=<vm_name>
gpid=`ps -ef | grep qemu | grep $vm | awk '{print $2}'`
iothreadnum=`ps -ef | grep 'iothread' | grep -v grep | wc -l`
iothreadnum=`expr $iothreadnum - 1`
for i in $(seq 0 1 $iothreadnum); do
    local pid=`ps -T -p $gpid | grep "iothread$i" | grep -v grep | awk '{print$2}'`
    /usr/bin/taskset -cp $n ${pid}
    n=`expr $n + 1`
done

3.5 Interrupt pinning

(applicable for both host and guest)

For tests that are device intensive like disk or network, we recommend binding the interrupts of the device on separate or subset of pCPUs on the host and in the guest. For example, below is a sample snippet for pinning interrupts for network device ens3f0np0 starting from pCPU <cpu_index>. Set <cpu_index> appropriately so it minimizes resource contention. In a host we don't usually recommend cpu_index to be 0 as mentioned in the beginning of Section 3:

#!/bin/bash
cpu=<cpu_index>

for x in `cat /proc/interrupts | grep ens3f0np0| awk '{ print $1 }' | cut -d':' -f 1`
do
echo $cpu > /proc/irq/$x/smp_affinity
cpu=$(($cpu+1))
done

3.6 vhost pinning

(applicable for host) If a guest is configured with vhost, refer to sample command below to pin vhost processes. Please set <cpu_index> and <vm_name> appropriately, where <cpu_index> stands for pCPU from which the binding starts, and <vm_name> is the name assigned to the VM in qemu command with "-name xxx" as mentioned in Section 2.1 and 2.2. Sample script:

#!/bin/bash
n=<cpu_index> 
vm=<vm_name>
gpid=`ps -ef | grep qemu | grep $vm | awk '{print $2}'`
vhostnum=`ps -ef | grep vhost | grep 00:00:00 | grep -v grep | wc -l`

for i in $(seq 0 1 $vhostnum); do

        vhost1=`ps -eLf | grep vhost-$gpid | grep -v grep | awk '{print $2}' | awk "NR==$i"`
        taskset -pc $n $vhost1
        n=$((n+1))

done

3.7 Application process pinning

(applicable for both host and guest)

We recommend pinning the application to individual CPUs on the host or in the guest, that are not used by IRQs, DPCs or other processes. There are multiple ways to achieve this, for example either use inherent parameters of the application or taskset/numactl commands in Linux.

4.0 Workload Commands and Download Sources


In this section, we discuss few sample microbenchmarks like Intel MLC, Flexible IO Tester (FIO) and iPerf* that can be used to stress memory, disk and network respectively inside the Intel TDX and legacy guests, to assess Intel TDX performance.

4.1 Memory Latency Checker

Intel Memory Latency Checker (Intel MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where bandwidth and latencies from a specific set of cores to caches or memory can be measured as well.
 

Run: Clean cache before executing Intel MLC each time:

sync; echo 3> /proc/sys/vm/drop_caches

Run mlc without any parameters for information regarding bandwidth and loaded latencies of the operating environment:

./mlc

Metrics of interest are, "Peak Injection Memory Bandwidths" and "Loaded Latencies".

To get idle latency run the following:

    ./mlc --idle_latency -b4g -e -r -l128 -c0 -j0

Note: Ignore below prompt about msr on VM:\ *** Unable to modify prefetchers (try executing 'modprobe msr') *** So, enabling random access for latency measurements

To get consistent idle memory latency durations, we recommend turning OFF C/P states in BIOS and stabilizing frequencies and minimizing context switches by using affinizations as discussed in previous sections.

Data Sharing For comparison, we would recommend computing the ratio for memory bandwidth items in the below table with units of MB/s and differences for latency items with units of ns.

HIB: Higher is better\
LIB: Lower is better

![](media/image3.png){width="7.415586176727909in"
height="1.1527438757655293in"}

4.2 FIO -- Flexible IO Tester

Flexible IO tester (FIO) is a microbenchmark used to issue different types of read/write operations to characterize disk devices.

First, install the tool:

    apt install fio #on Ubuntu**
    dnf install fio #on Centos**

Next, for disk stress test, we recommended assigning a separate virtual/hard disk for the test. Some sample instructions are shared below to create and assign a new disk to the guest.

  1. Create a new virtual disk with qemu-img tool
    #qemu-img create -f qcow2 perf_test.qcow2 10G

This creates a disk file of size 10GB called perf_test.qcow2, in current directory.

  1. Add below parameters to qemu command line to assign the virtual disk image to the guest:
-object iothread,id=iothread1 -drive file=perf_test.qcow2,if=none,id=virtio-disk1,format=qcow2,cache=none,aio=native -device virtio-blk-pci,drive=virtio-disk1,iothread=iothread1
  1. Check if the disks are visible in the guest after guest boots up. You will find a virtual disk corresponding to the qcow2 file attached in previous step in the guest after it boots up, for instance /dev/vdb
# ls /dev/vd*
/dev/vda    /dev/vda1   /dev/vda2   /dev/vda3   /dev/vdb

Refer sample commands below to run FIO inside a Intel TDX guest or legacy guest, where results are stored in directory location /home/Analysis/runs_data, workload is pinned to CPU <cpu_index>, issues read/write operations to /dev/vdb disk, uses an IO size of 64KB for sequential operations and 4KB for random operations, uses asynchronous I/O, with an iodepth of 1024, ramps up for 10 seconds and runs for 290 seconds.

Set <cpu_index> to an appropriate vcpu number that reduces resource contention.

#!/bin/bash
datarun_dir='/home/Analysis/runs_data'
mkdir –p $datarun_dir

#sequential read 
cpu=<cpu_index>
numactl -C $cpu fio --name=job --filename=/dev/vdb --rw=read --bs=64k --ioengine=libaio --direct=1 --iodepth=1024 --numjobs=1 --clocksource=cpu  --norandommap --time_based --invalidate=1  --ramp_time=10 --runtime=290 > $datarun_dir/fio_seqread.txt

#sequential write
numactl -C $cpu fio --name=job --filename=/dev/vdb --rw=write --bs=64k --ioengine=libaio --direct=1 --iodepth=1024 --numjobs=1 --clocksource=cpu  --norandommap --time_based --invalidate=1  --ramp_time=10 --runtime=290 > $datarun_dir/fio_seqwrite.txt

#random read
numactl -C $cpu fio --name=job --filename=/dev/vdb --rw=randread --bs=4k --ioengine=libaio --direct=1 --iodepth=120 --numjobs=1 --clocksource=cpu  --norandommap --time_based --invalidate=1  --ramp_time=10 --runtime=290 > $datarun_dir/fio_randread.txt

#random write
numactl -C $cpu fio --name=job --filename=/dev/vdb --rw=randwrite --bs=4k --ioengine=libaio --direct=1 --iodepth=120 --numjobs=1 --clocksource=cpu  --norandommap --time_based --invalidate=1  --ramp_time=10 --runtime=290 > $datarun_dir/fio_randwrite.txt

Please configure these commands appropriately for the test environment.

Report average latency and throughput metrics from FIO. For example, refer the highlighted metrics for a sample Read operation:

[root@b49691a74ae8 runs_data]# numactl -C 4 fio --name=job --filename=/dev/vdb --rw=read --bs=64k --ioengine=libaio --direct=1 --iodepth=1024 --numjobs=1 --clocksource=cpu  --norandommap --time_based --invalidate=1  --ramp_time=10 --runtime=10
job: (g=0): rw=read, bs=® 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=1024
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=3208MiB/s][r=51.3k IOPS][eta 00m:00s]
job: (groupid=0, jobs=1): err= 0: pid=416222: Mon Feb 26 21:28:35 2024
  read: IOPS=50.2k, BW=3141MiB/s (3294MB/s)(30.8GiB/10024msec)
    slat (nsec): min=1290, max=954198, avg=19497.25, stdev=49570.03
    clat (usec): min=16391, max=43700, avg=20377.48, stdev=1864.71
     lat (usec): min=16394, max=43855, avg=20396.98, stdev=1865.32
    clat percentiles (usec):
     |  1.00th=[17171],  5.00th=[17433], 10.00th=[19268], 20.00th=[19792],
     | 30.00th=[19792], 40.00th=[19792], 50.00th=[19792], 60.00th=[20055],
     | 70.00th=[20055], 80.00th=[21103], 90.00th=[22414], 95.00th=[23987],
     | 99.00th=[26870], 99.50th=[27132], 99.90th=[33817], 99.95th=[38536],
     | 99.99th=[42730]
   bw (  MiB/s): min= 2661, max= 3225, per=100.00%, avg=3143.56, stdev=133.85, samples=20
   iops        : min=42585, max=51611, avg=50296.90, stdev=2141.63, samples=20
  lat (msec)   : 20=65.93%, 50=34.28%
  cpu          : usr=1.82%, sys=14.46%, ctx=56652, majf=0, minf=36
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=502803,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: bw=3141MiB/s (3294MB/s), 3141MiB/s-3141MiB/s (3294MB/s-3294MB/s), io=30.8GiB (33.0GB), run=10024-10024msec

Disk stats (read/write):
  vdb: ios=922364/400, merge=0/1559, ticks=20404170/2264, in_queue=20406433, util=99.78%

4.3 iPerf3

iPerf3, is a tool for network performance measurement and tuning.

First, install the tool:

apt install iperf3  #on Ubuntu
dnf install iperf3  #on Centos

Next, refer sample iPerf3 commands below where multiple instances of the server and client run on individual CPUs on two different systems respectively starting from port number 53600 on both and runs for a duration of 300 seconds. Set <cpu_index> to an appropriate vcpu number and <server_ip> to the IP address used by the iPerf3 server:

#Sample commands to run iperf3 server
#!/bin/bash
cpu=<cpu_index>
server_ip=<server_IP>
instances=<number_of_iperf3_instance> # 3 or 4 instances can saturate 100G NIC

netport=53600

for i in $(seq 0 $((instances-1))); do
iperf3 -B $server_ip -s -p $((netport+i)) -A $((cpu+i)) >iperf_host_server$i.txt 2>&1 &
done


#Sample commands for iperf3 client
#!/bin/bash
cpu=<cpu_index>
server_ip=<server_IP>
instances=<number_of_iperf3_instance> # 3 or 4 instances can saturate 100G NIC
receive=<0|1> # 0 for send test; 1 for receive test

if [ $receive -eq 1 ]; then
    mode="-R"
else
    mode=""
fi

netport=53600
dur=300

for i in $(seq 0 $((instances-1))); do
    iperf3 -c $server_ip -P 1 -t $dur -p $((netport+i)) -A $((cpu+i)) $mode >iperf_client_$i.txt 2>&1 &
done

sleep $((dur+3)) # wait for the test to finish

cat iperf_client_*.txt | awk '/receiver/{if($8=="Mbits/sec"){a+=$7/1000}else{a+=$7}}END{print "Accumulated recv BW (GBits/s): ", a}'
cat iperf_client_*.txt | awk '/sender/{if($8=="Mbits/sec"){a+=$7/1000}else{a+=$7}}END{print "Accumulated send BW (GBits/s): ", a}'    

Please configure these commands appropriately for the test environment.

Report "Bitrate" metric highlighted below. For example:

# iperf3 -c 150.150.1.10 -P 1 -t 10 -p 53600 -A 9
Connecting to host 150.150.1.10, port 53600
[  5] local 150.150.1.10 port 46648 connected to 150.150.1.20 port 53600
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.85 GBytes  67.4 Gbits/sec    0   1.94 MBytes
[ 5]   1.00-2.00   sec  7.96 GBytes  68.3 Gbits/sec    0   2.44 MBytes
[  5]   2.00-3.00   sec  7.97 GBytes  68.5 Gbits/sec    0   2.87 MBytes
[  5]   3.00-4.00   sec  7.91 GBytes  68.0 Gbits/sec    0   2.87 MBytes
[  5]   4.00-5.00   sec  7.93 GBytes  68.1 Gbits/sec    0   3.00 MBytes
[  5]   5.00-6.00   sec  7.95 GBytes  68.3 Gbits/sec    0   3.00 MBytes
[  5]   6.00-7.00   sec  8.07 GBytes  69.3 Gbits/sec    0   3.00 MBytes
[  5]   7.00-8.00   sec  8.12 GBytes  69.8 Gbits/sec    0   4.68 MBytes
[  5]   8.00-9.00   sec  8.08 GBytes  69.4 Gbits/sec    0   4.68 MBytes
[  5]   9.00-10.00  sec  8.07 GBytes  69.3 Gbits/sec    0   4.68 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate**         Retr
[  5]   0.00-10.00  sec  79.9 GBytes  68.7 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  79.9 GBytes  68.4 Gbits/sec                  receiver

5.0 Sanity check


To sanity check the test bed, compare the memory access idle latencies using Intel Memory Latency Checker (Intel MLC) for the following in the guest:

  • A = non-confidential legacy VM with both Intel MLC and Intel TME bypass disabled in BIOS
  • B = non-confidential legacy VM with Intel TME enabled and Intel TME bypass disabled in BIOS
  • C = non-confidential legacy VM with Intel TME enabled and Intel TME bypass enabled in BIOS

Note: Intel TME bypass enabled and Intel TME disabled isn't a valid configuration. Check with the platform vendor

Please report to Intel.
 

  • Intel TME overhead: B-A.
  • Intel TME bypass overhead: C-A.

If not, please ensure platform has the latest BIOS/firmware and kernel from the vendor. Also ensure all the right BIOS knobs are enabled for performance as recommended by the platform vendor.

6.0 System metrics data collection and Analysis for debugging and tuning purposes


Collect below system metrics in host and guest while running workloads and post process them to confirm that the workload is tuned for better performance.

6.1 Data Collection in Host

We recommend collecting CPU, disk, network, memory, and interrupt statistics on the host while the workload is running. This is applicable for all workloads except Intel MLC.

Install sysstat tool:

apt install sysstat # on Ubuntu.
dnf install sysstat # on Centos

Collect system statistics:

#!/bin/bash
#flush the cache
sync; echo 3> /proc/sys/vm/drop_caches 
#kill any previously running sar/iostat/mpstat instances
pkill sar; pkill iostat; pkill mpstat

#create a directory to store data from the tools
mkdir -p /home/Analysis/runs_data

#move to data directory
cd /home/Analysis/runs_data

#collect sar for CPU, memory and other stats and iostat for disk IO stats and mpstat for interrupt rate stats. sar.out is the output file for stats from sar.
# below commands gather system stats once every 10 seconds, 30 times. i.e overall 300 seconds. User can change it to any meaningful values. If the workload has steady and uniform characteristics, as in the case of fio/iperf, short durations are ample enough to give representative data.

sar -A -o sar.out 10 30 > /dev/null 2>&1 & iostat -d -k -x -y 10 30 > io.txt & mpstat -I CPU 10 30 > mpstat.txt &

You can also capture guest exit statistics on the host using kvm_stats.

Install kvm_stat from kernel tools:

apt install linux-tools       #On Ubuntu
dnf install kernel-tools      #On Centos

Capture guest exit statistics:

#identify the process ID, i.e PID of the qemu guest process.
qemu_pid=`ps -ef | grep qemu | grep -v grep | awk '{ print $2}'`

# collect guest exit stats of the corresponding PID.
kvm_stat -p $qemu_pid -l > kvm_stat_guest.txt 2>&1 &

#kvm_stat is collected for 20 seconds. If the workload has steady and uniform characteristics, as in the case of fio/iperf, short durations are ample enough to give representative data.
sleep 20

#at the end of sleep duration, kill kvm_stat to stop data collection

pid=`ps -ef | grep kvm_stat | grep -v grep | awk '{print $2}'`
kill –9 $pid

To identify hotspots in the host:

#captures CPU hotspot profiles on all CPUs along with call stacks for 10 seconds.
perf record -a -g -- sleep 10

#post processes the output perf.dat from the above command
perf report -i perf.dat –no-children

6.2 Data collection in guest

Similarly, we recommend collecting CPU, disk, network, memory and interrupt statistics in the guest while the workload is running, except for Intel MLC.

Install sysstat tool:

apt install sysstat # on Ubuntu.
dnf install sysstat # on Centos

Collect system statistics:

#flush the cache
sync; echo 3> /proc/sys/vm/drop_caches 
#kill any previously running sar/iostat/mpstat instances
pkill sar; pkill iostat; pkill mpstat

#create a directory to store data from the tools
mkdir -p /home/Analysis/runs_data

#move to data directory
cd /home/Analysis/runs_data

#collect sar for CPU, memory and other stats and iostat for disk IO stats and mpstat for interrupt rate stats. sar.out is the output file for stats from sar.
# below commands gather system stats once every 10 seconds, 25 times. i.e overall 250 seconds. User can change it to any meaningful values. If the workload has steady and uniform characteristics, as in the case of fio/iperf, short durations are ample enough to give representative data.

sar -A -o sar.out 10 25 > /dev/null 2>&1 & iostat -d -k -x -y 10 25 > io.txt & mpstat -I CPU 10 25 > mpstat.txt &

For data analysis, copy all the data to host.

First create a directory on the host:

mkdir /home/Analysis/runs_data/guest

Next, copy the data from the guest into the host, where <host_ip> is the IP address of the host:

scp /home/Analysis/runs_data/* <host_ip>:/home/Analysis/runs_data/guest

To identify hotspots in the guest:

#captures CPU hotspot profiles on all CPUs along with call stacks for 10 seconds.
perf record -a -g -- sleep 10

#post processes the output perf.dat from the above command
perf report -i perf.data --no-children

6.3 Data post processing on host and guest

Post process output from sysstat to demonstrate CPU, memory, network, disk utilization on the host and guest.

#move to directory location with data files collected in above sections
cd /home/Analysis/runs_data;

#extract disk stats from sar.out
sadf -d sar.out -- -d -p > d.csv; sed -i 's/;/,/g' d.csv;

#extract memory stats from sar.out
sadf -d sar.out -- -r > r.csv; sed -i 's/;/,/g' r.csv;

#extract network device stats from sar.out
sadf -d sar.out -- -n DEV > n.csv; sed -i 's/;/,/g' n.csv;

#extract CPU stats from sar.out
sadf -d sar.out -- -u ALL -P ALL > CPU.csv; sed -i 's/;/,/g' CPU.csv;

#extract all stats from sar.out
sar -A -f sar.out > sar_all.dat;

#extract IO stats from io.txt 
echo "Device,r/s,rkB/s,rrqm/s,%rrqm,r_await,rareq-sz,w/s,wkB/s,wrqm/s,%wrqm,w_await,wareq-sz,d/s,dkB/s,drqm/s,%drqm,d_await,dareq-sz,f/s,f_await,aqu-sz,%util" > io.csv
awk '!/^Linux/' io.txt | awk NF | awk '!/^Device/' | tr -s ' ' ',' >> io.csv

#capture other system logs
dmesg > dmesg.txt
lscpu > lscpu.txt
cat /proc/meminfo > proc_memory.txt

7.0 Format for Data Reporting to Intel


Please share the following with Intel:
 

a. Workload performance results from individual runs, average and run to run variations.

b. Average or Cumulative CPU statistics on the host and guest.

c. Average memory statistics on the host and guest

d. Average network statistics on host and guest.

e. Average disk statistics on host and guest.

f. Average interrupt statistics on host and guest.

g. Average or Cumulative kvm exit statistics on host.

h. Reports from perf tool for hot spot analysis on host and guest.

i. Intel MLC and Intel TME bypass and Intel TME overheads -- idle latencies and mem BW comparisons.

Further details are welcome.

8.0 Data Post Processing Sample Scripts


Run the below commands to extract Average metrics from data collected in Section 6.3.

 

#! /bin/bash

############################################################################################################################################
#Sample script from Intel for summarizing post processed data for reporting.
############################################################################################################################################

echo ##########################################
echo HOST REPORTING
echo ##########################################

dir_loc=$1

#cat $dir_loc/sar_all.dat | awk '{if(NR==3) print $0}'
echo "Average     CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest    %gnice     %idle"
cat $dir_loc/sar_all.dat | grep Average  | grep all | head -n 1

echo
echo "TotalIntrs/sec"
#cat $dir_loc/mpstat.txt  | head -n 3 | grep '0/s'
cat $dir_loc/mpstat.txt | tail -n  64 |  awk '{sum = 0; for (i = 4; i <= NF; i++){ sum += $i }; print sum;}' | awk '{ total += $1; count++ } END { print total}'

echo 
cat $dir_loc/io.txt | grep Device | head -n 1
cat $dir_loc/io.txt | grep nvme0n1 | awk ' FNR==1 { nf=NF} {  for(i=1; i<=NF; i++)    arr[i]+=$i ; fnr=FNR } END {  FS="\t";  for( i=1; i<=nf; i++)    printf("%.3f%s", arr[i] / fnr, (i==nf) ? "\n" : FS) }'

echo
echo -e "Network\tinterval\ttimestamp\tIFACE\t rxpck/s \t txpck/s \t rxkB/s \t txkB/s \t rxcmp/s \t txcmp/s \t rxmcst/s \t %ifutil"
cat $dir_loc/n.csv | grep eno8303 | awk ' BEGIN { FS = "," } ; FNR==1 { nf=NF} {  for(i=1; i<=NF; i++)    arr[i]+=$i ; fnr=FNR } END { FS="\t";   for( i=1; i<=nf; i++)    printf("%.3f%s", arr[i] / fnr, (i==nf) ? "\n" : FS) }'

echo 
cat $dir_loc/kvm_stat_guest.txt | head -n 1
cat $dir_loc/kvm_stat_guest.txt |  awk '(NR>1)' | awk ' FNR==1 { nf=NF} {  for(i=1; i<=NF; i++) arr[i]+=$i ; fnr=FNR } END {   for( i=3; i<=nf; i++)    printf("%.3f%s", arr[i] / fnr, (i==nf) ? "\n" : FS) }' 

echo

echo ##########################################
echo GUEST REPORTING
echo ##########################################

dir_loc="$1/guest"


echo "Average     CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest    %gnice     %idle"
cat $dir_loc/sar_all.dat | grep Average | grep all | head -n 1

echo
echo "TotalIntrs/sec"
cat $dir_loc/mpstat.txt | tail -n 64 |  awk '{sum = 0; for (i = 4; i <= NF; i++){ sum += $i }; print sum;}' | awk '{ total += $1; count++ } END { print total}'

echo
cat $dir_loc/io.txt | grep Device | head -n 1
cat $dir_loc/io.txt | grep vdb | awk ' FNR==1 { nf=NF} {  for(i=1; i<=NF; i++)    arr[i]+=$i ; fnr=FNR } END {  FS="\t";  for( i=1; i<=nf; i++)    printf("%.3f%s", arr[i] / fnr, (i==nf) ? "\n" : FS) }'
cat $dir_loc/io.txt | grep vdc | awk ' FNR==1 { nf=NF} {  for(i=1; i<=NF; i++)    arr[i]+=$i ; fnr=FNR } END {  FS="\t";  for( i=1; i<=nf; i++)    printf("%.3f%s", arr[i] / fnr, (i==nf) ? "\n" : FS) }'
cat $dir_loc/io.txt | grep vdd | awk ' FNR==1 { nf=NF} {  for(i=1; i<=NF; i++)    arr[i]+=$i ; fnr=FNR } END {  FS="\t";  for( i=1; i<=nf; i++)    printf("%.3f%s", arr[i] / fnr, (i==nf) ? "\n" : FS) }'

echo
echo -e "Network\tinterval\ttimestamp\tIFACE\t rxpck/s \t txpck/s \t rxkB/s \t txkB/s \t rxcmp/s \t txcmp/s \t rxmcst/s \t %ifutil"
cat $dir_loc/n.csv | grep ens9 | awk ' BEGIN { FS = "," } ; FNR==1 { nf=NF} {  for(i=1; i<=NF; i++)    arr[i]+=$i ; fnr=FNR } END { FS="\t";   for( i=1; i<=nf; i++)    printf("%.3f%s", arr[i] / fnr, (i==nf) ? "\n" : FS) }'

echo

References, Notices and Disclaimers


Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that a. you may publish an unmodified copy and b. code included in this document is licensed subject to the Zero-Clause BSD open source license (0BSD), Open Source Initiative. You may create software implementations based on this document and in compliance with the foregoing that are intended to run on the Intel product(s) referenced in this document.