kvm logo

Performance Optimizations for Gaming on Virtual Machines with KVM/crosvm

Compared to native execution, Steam gaming workloads experience a performance gap when executed in a VM environment. Here's what we did to fix this:

  • Experimented with supporting Inter Procedural optimizations (LTO).

  • Set the Swappiness value in the kernel.

  • Our implementation achieved ~5% LTO optimizations and 4% to 9% improvment in Swappiness, along with smoother frames and no spikes.

author-image

By

Introduction

Running games in a virtual environment shows performance regression when compared to native. In the Borealis VM environment, the average performance gap was between 5%- 35% on various Steam gaming workloads compared to native. In this article, we present the performance gains achieved by supporting interprocedural optimization (IPO) and optimal swappiness [1] configurations.

Our implementation has achieved a 5% - 9% performance gain in-game and gaming benchmarks. The overall VM performance gap with native was reduced by 5%- 10% with the above optimizations.

Description

A virtual machine manager (VMM) or “hypervisor,” is one of many hardware virtualization techniques that allow multiple operating systems, called guests, to run concurrently on a host computer. It is defined because it is conceptually one level higher than a supervisory program. The hypervisor presents to the guest operating systems a virtual operating platform and manages the execution of the guest operating systems. Multiple instances of different operating systems may share the virtualized hardware resources.

VM High Level Architecture – VMM Layer
Figure 1: VM High Level Architecture – VMM Layer

 

There are two types of VMs—Type 1, which runs on bare metal and Type 2, which runs on top of a hypervisor.  As shown in Figure 1, crosvm is a VMM, or a Type 2 hypervisor, that runs entirely in user space and works in tandem with KVM’s kernel component to set up and execute VMs. crosvm defines the virtual hardware platform and memory layout of a VM, manages the execution lifecycle of the VM, and most importantly, implements the logic required to handle the device accesses that KVM cannot handle.

Borealis VM [3]

Borealis is the Google code name for the VM solution used for running Steam games. Steam on ChromeOS leverages the work that has been done for several years with Crostini and ChromeOS’ s virtual machine monitor (VMM). Running Steam and its games within a VM is the best way to deliver on ChromeOS’s goals of speed, security, and simplicity. Having a clear security boundary allows users to keep their data safe. Games aren’t allowed direct access to a user’s files, host GPU, or host kernel. Instead, separation is maintained by our proven VMM used for Crostini, Android, and Parallels. This provides another layer of security above normal Linux systems.

Building on Valve’s great work to make games compatible with the Steam Deck, the gaming VM runs a modified version of Arch Linux designed specifically for gaming. This image, codenamed “Borealis”, is automatically kept up to date and updated with each ChromeOS release, ensuring users don’t need to deal with updating drivers or libraries themselves. Steam and all its games are kept within a single VM image; if a user ever needs to uninstall it, they know everything will be removed in a single step, in keeping with ChromeOS’s goal of simplicity.

Borealis architecture
Figure 2: Borealis VM Architecture

 

As shown in Figure 2, to guarantee high levels of performance, the VM is invisible to the user – from both an operational and a performance perspective. crosvm relies upon the Linux KVM hypervisor and uses para virtualized virtio-based devices instead of emulation. To play the latest Vulkan games, a Venus virtual driver is built; this is a very low overhead Vulkan virtualization driver that offers near-native performance.

 

Interprocedural optimization:

Background

Interprocedural optimization (IPO) [4] is an automatic, multi-step process that allows the compiler to analyze your code to determine where you can benefit from specific optimizations. IPO is a collection of compiler techniques used in computer programming to improve performance in programs containing many frequently-used functions of small or medium length. IPO differs from other compiler optimizations because it analyzes the entire program; other optimizations look at only a single function, or even a single block of code.
IPO seeks to reduce or eliminate duplicate calculations and inefficient use of memory, and to simplify iterative sequences, such as loops. If there is a call to another routine that occurs within a loop, IPO analysis may determine that it is best to inline that. Additionally, IPO may re-order the routines for better memory layout and locality.
IPO may also include typical compiler optimizations on a whole-program level, such as dead code elimination (DCE), which removes code that is never executed. To accomplish this, the compiler tests for branches that are never taken and removes the code in that branch. IPO also tries to ensure better use of constants. Modern compilers offer IPO as an option at compile time. The actual IPO process may occur at any step between the human-readable source code and producing a finished executable binary program.

  • Effective IPO across translation units (module files) requires knowledge of the "entry points" of the program so that a whole program optimization (WPO) can be run.
  • In many cases, this is implemented as a link-time optimization (LTO) pass, because the whole program is visible to the linker

LLVM features powerful intermodular optimizations that can be used at link time. Link time optimization (LTO) is another name for intermodular optimization that is performed during the link stage.

Figure 3: Link Time Optimization Flow

 

“libvirglrenderer” is a library used by crosvm to implement 3D GPU support for the virtio GPU. As shown in Figure 4, libvirglrenderer is a critical module that consumes CPU cycles. This module can be optimized by enabling LTO compiler options. The compiler looks at this module as a whole and applies optimizations in loops, in-lining etc. This eliminates the redundant functions and optimizes the functions to minimize the CPU utilization. This in turn saves CPU cycles on the transitions between the guest kernel virtio device frontend driver and the VMM virtio io device backend driver. This also helped to improve the FPS for gaming workloads.

Figure 4: CPU Utilization

 

The performance results of enabling LTO in the libvirglrenderer module are shared in Results and discussion.

 

Swappiness:

Background - ZRAM:

Swap functionality is analogous to the swap file on Windows. If the main memory is utilized to its threshold, then the data is written to the hard disk. If it is required, the data written to the hard disk is read back and processed. This transition has latency and even the fastest SSD read/writes are slower than RAM. Unnecessary storage resources are compressed and then moved to a reserved area in fixed RAM, also known as ZRAM or swap in memory. The key advantage of this feature is the speed; reading the swap partition in RAM is much faster than swap partitions on a hard drive.

Swap in Hard Disk (zswap)

Slow

Swap in Memory (zram)

Faster than zswap

RAM Access

Fastest

 

In addition, ZRAM uses minimal RAM as the data is compressed and kept, which is advantageous for devices with less RAM. One disadvantage of ZRAM is the CPU time taken for compression and decompression of the data, which in turn consumes power.

Swappiness:

Swappiness [5] is a property of the Linux kernel that changes the balance between swapping out runtime memory, as opposed to dropping pages from the system page cache. Swappiness can be set to values between 0 and 100, inclusive. A low value means the kernel tries to avoid swapping as much as possible, whereas a higher value makes the kernel aggressively try to use swap space.

As shown in Figure 5, a significant amount of CPU time was spent in gaming workloads by ZRAM in the compression of data and storing it. lzo_compress.ko is the ZRAM compression stream. This shows that we have potential for swappiness.

Figure 5: ZRAM CPU Utilization

 

The performance results of setting an optimal Swappiness value are shared in Results and discussion.

Results and discussion

Note: For all performance and power assessments, a median of 3 iterations is used with variance removed.

In our experiments, we selected the games/benchmarks shown in Table 1. Details of graphics settings and resolution are also specified in the table.

Table 1: Games and Gaming Benchmarks Settings

Game

Graphics APIs

Resolution

Graphics Setting

Type

XCOM2

OpenGL

1920X1080

Low

Game

Metro 2033 Redux

OpenGL

1920X1080

Low

Game

Dota2 - OpenGL

OpenGL

1920X1080

Low

Game

Spaceship

OpenGL

1920X1080

Low

Gaming Benchmark

Tomb Raider

OpenGL

1920X1080

Low

Gaming Benchmark

 


Link time optimization results

 

With link time optimization enabled in the libvirglrenderer module, refer to Figure 6 for code layout and function optimizations. We observed an average 5% gain in average FPS for the gaming titles; refer to Figure 7.

Figure 6: Code layout with and without LT

 

Figure 7: Borealis – R105:  FPS Comparison Base vs LTO

 

As shown in Figure 8, with link time optimization enabled, we observed a lower frame time that in turn benefits smooth game display.

Figure 8: Borealis R105: Frame Time Distribution: Base vs LTO

 

As shown in Figure 9, with link time optimization enabled, we observed that frame time percentage increased.

Figure 9: Borealis R105: Frame Time % Distribution:  Base vs LTO

Swappiness optimization results:

By setting the optimal value for the Swappiness kernel parameter, the CPU time spent on the compression stream is reduced and in turn, CPU cycles are saved. This time is additionally utilized by Borealis VM and achieved higher FPS for games.
Refer to Figure 10 for the CPU time differences by setting the Swappiness kernel value to 10. The default value is 60.

Figure 10:  Compression Module CPU Utilization Reduction

 

As shown in Figure 11, with the optimal value of Swappiness (10), we observed an average 6% gain in minimum FPS for the gaming titles.

Figure 11: FPS Comparison Base - Swappiness (60) vs Swappiness (10)

 

As shown in Figure 12, with the optimal value of Swappiness (10), we observed a lower frame time that in turn benefits smooth game display.

Figure 12: Frame Time Distribution Swappiness Default (60) vs Swappiness (10)

 

As shown in Figure 13, with the optimal value of Swappiness (10), we observed frame time percentage has increased.

Figure 13: Frame Time % Distribution Swappiness (60) vs Swappiness (10)

 

Summary

In this paper, we outlined the Steam gaming workloads performance gap executed in a VM environment compared to native. Experiments were carried out to support Inter Procedural optimizations (LTO) and setting the Swappiness value in the kernel. Our implementation achieved average  5% LTO optimization and an average 4% to 9% in Swappiness tuning in thee performance improvement in FPS of games and gaming benchmarks. In addition, the frame time is smoother, and no spikes aware observed. These optimizations were accepted by Google, and they are in process of being integrated into ChromeOS.

Test configuration

Software: Arch Linux, OpenGL ES 3.1 Mesa Support

Hardware: Asus Chromebook, Intel® Core™ i7-1165G7 processor, 4 Core 8 Threads, 16 GB Ram

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be secure. Check with your system manufacturer or retailer or learn more at www.intel.com.

 

References

 

[1]

[Online]. Available: https://docs.kernel.org/admin-guide/sysctl/vm.html.

[2]

"RUST - A language empowering everyone," [Online]. Available: https://www.rust-lang.org/.

[3]

"Borealis VM," [Online]. Available: https://chromeos.dev/en/posts/bringing-steam-to-chromeos.

[4]

"Inter Procedural Optimization," [Online]. Available: https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming/interprocedural-optimization.html.

[5]

"Swappiness," [Online]. Available: https://access.redhat.com/solutions/103833.

[6]

Prilik, ""Para virtualized Devices in crosvm - a Performance Panacea for Modern VMs"," [Online]. Available: https://prilik.com/blog/post/crosvm-paravirt/.