Running “Large” Software on Wind River® Simics® Virtual Platforms, Then and Now

This is the first of a series of posts that will look at the past, present, and future of Wind River® Simics® virtual platform

Simics 20 Years rounded corners smaller
 
This is the first of a series of posts that will look at the past, present, and future of Wind River® Simics® virtual platforms. In 2018, it is 20 years since its commercial launch as a product. At the time back in 1998, Simics was marketed by a startup called Virtutech, which was acquired by Wind River, an Intel company, in 2010. 
To run real-world software on the virtual platform, the Simics team has always strived for target scalability and simulation speed. Real server workloads already ran on Simics 20 years ago. But what counted as “large” software then and what that means today are two different things. Let’s take a look back and then make a comparison to what Simics runs today. 

Then…

A seminal paper published 20 years ago described how Simics would boot unmodified Sun* Solaris* 2.6 and Linux* 2.0.30 operating systems (OSs) on a quad-processor virtual sun4m-architecture-based SPARCstation* using disk images from a real machine. The setup was used to run the Mozilla* 5.0 web browser, as well as the Transaction Processing Council’s TPC-D database benchmark software, using a PostgreSQL* database. It was remarkable at the time: full-system simulation and virtual platforms were still new. Just taking a software stack from a real machine and running it inside a simulator quickly enough for interactive usage was practically unheard of. 
The simulated sun4m architecture was used in workstations and servers that employed a variety of 32-bit SPARC V8 RISC processors. These platforms supported up to 512 MB of RAM, up to four processors at up to 200 MHz clock frequency.  
Most actual hardware systems were more pedestrian, featuring sub-100 MHz clocks and sub-100 MB memories. A large software setup would have used a few hundred megabytes of RAM—today Microsoft* Word is using that much RAM on my laptop as I write this blog. 
Simics back then used a pure interpreter in the core simulation instruction set processing engine. That made it between 25-100 times slower than actual hardware. Still, it was fast enough to get through the billion target instructions needed to boot Solaris in a reasonable amount of time. At the time, this was state-of-the-art. 

…and now

Today, a high-end system can have 512 GB of RAM (1000x the SPARCstation above), 40+ cores (10x), and processor core clock frequencies of two to five GHz (25x). Thanks to micro-architectural improvements, new instructions, and other innovations, we probably run code about 100x faster on today’s systems compared to 20 years ago. 
Simics has also improved to keep up with the times. The interpreter mode that was used in 1998 is rarely used these days. Instead, Simics relies on VMP (using Intel® Virtualization Technology for IA-32, Intel® 64 and Intel® Architecture (VT-x) to run Intel® architecture-based target code directly on the host) and just-in-time (JIT) compiler technology to convert target code to host code. These techniques make the slowdown of the virtual machine as low as 1x when running code. Thus, Simics run large workloads on models of contemporary hardware. 
To see how workloads and the scope of virtual platforms have changed, let’s look at some examples of software we have seen running on Simics over the years. We start with some Java examples—in 1998, most server software was native, compiled for a particular processor architecture and OS. But Java was getting started on the client side, and a few years later, it jumped over to the server side. That provided a mostly host-independent environment for building business applications, thanks to the use of a Java Virtual Machine (JVM) to run byte code rather than native code. Thus, running JVMs on Simics is pretty common today. 

Just-in-Time running just in time

Java*-based benchmarks running on Simics provide an interesting example of stacked computing layers. On the top level, Java code runs on top of a Java Virtual Machine (JVM). The JVM uses a JIT compiler to translate the JVM byte code to target system code for execution. Next, that target code runs on a Simics virtual platform that contains its own JIT. The Simics JIT converts target system code to host system code. The host system code runs on the Simics host—see the diagram below:
Simics host diagram 1
 
The stacking of virtual platforms and virtual machines works just fine!
Note that in addition to the JIT, Simics “VMP” technology runs Intel architecture (IA) target code directly on the host by using Intel® Virtualization Technology (Intel® VT) for IA. This makes Simics performance similar to that of a typical virtual machine (VM) or hypervisor running on actual hardware. 

SpecJEnterprise on Simics

Our first Java example is SpecJEnterprise 2010, a benchmark from the Standard Performance Evaluation Corporation (SPEC) consortium that mimics a business workload with an application server written in Java*, talking to a back-end database. The setup contains a driver utility that sends stimuli to the application server, which in turn uses the database. SpecJEnterprise measures system performance, including the hardware, JVM, database engine, networking, and other components. We run SpecJEnterprise on Simics as a platform test case because it is a good stimulus to test the integration of the OS, Unified Extensible Firmware Interface (UEFI), and the hardware platform.
SpecJEnterprise requires at least two machines to run: one for the database and one for the application server. The driver utility can run on the same machine as the application server. The Simics setup is illustrated below, with two target systems inside a single Simics process. This provides neat encapsulation that does not depend on running external software or coordinating multiple simulation programs. In practice, the two Linux distributions used are slightly different, since each software stack comes with its own recommended Linux OS. 
Wind River Simics Diagram 2
 
The two target machines have the same hardware configuration, and they are connected using 10Gbps Ethernet. Each target has four processor cores split over two sockets, and 192 GB of simulated RAM (96 GB attached to each socket). As a whole, the Simics setup simulates 384 GB of target RAM. Four processor cores per target system is a small configuration that is sufficient to run the benchmark setup; the Simics platform can support many more cores than that. As in 1998, Simics supports configurations all the way to the limits of the physical platform, and beyond
When run, this configuration uses between 300 and 400 GB of host machine physical RAM—most of the simulated target RAM ends up being used and thus represented in Simics. As noted in a previous blog post, Simics can simulate very large target memories without using host RAM if the memory is not used for active data.
The two systems have their own simulated disks, and each disk has its own image. This image is a full bootable disk, basically the same as you would use on a physical system. Doing full-stack software development and execution independent of hardware and before the hardware appears are key benefits of the virtual platform. 
Depending on the server load, the virtual platform slowdown is about 4x. Each benchmark test runs for about 3.5 hours in on the target machine, and between 12 and 14 hours to run in real-world time (wall-clock), depending on the load on the server which works out to a slowdown of around 4. It is good for this kind of workload – and—an order of magnitude better than in 1998!

SpecJBB on Simics

SpecJBB 2015 is called a “Java server business benchmark.” It measures the performance of Java virtual machines and consists of a three-tier “business application.” SpecJBB 2015 can use a varying number of JVM instances. In our setup, we use a single JVM. The benchmark runs on a single target machine, both in the real world and on Simics.
Wind River Simics Diagram 3
 
The setup in Simics is shown above. It contains a single server target system with 384 GB of simulated RAM and four processor cores split across two sockets.  The software stack runs on Linux, just like SpecJEnterprise.  The target system server boots using a real UEFI from the real platform being modeled, and the server model is a full model of a server platform with all the details and peculiarities of a particular hardware platform, including the processor cores, uncore, and Platform Controller Hub (PCH). It is not a “generic system” but a rather specific model. 
Each run of SpecJBB on Simics currently takes about 3.5 hours to run on the virtual platform, and about 12 real-world hours to complete (depending on other loads running on the same server running Simics). This is a slowdown less than 4x, similar to SpecJEnterprise. 

HammerDB on Simics

HammerDB is an open-source database load testing and benchmarking tool. It is not a database in its own right, but a tool to “hammer” databases with transactions to test their performance under load.  To run HammerDB, you use two separate machines, one with the database and one running the HammerDB tool. This setup is replicated in Simics by putting two server machines into a single Simics instance, just as we did with SpecJEnterprise.
Wind River Simics Diagram 4
 
HammerDB requires “only” 128 GB of RAM in each simulated target machine, with 64GB attached to each processor socket. Depending on the load from other software running on the server at the same time, HammerDB needs up to 20 hours of host time to run through 1.5 to 2 hours of virtual time—still within a factor of 10. I consider that quite reasonable for virtual platforms, especially considering the scale of the system. It is quite a bit better than the slowdowns observed in 1998. 
HammerDB features a graphical user interface (GUI) to execute tests and check results. The GUI displays on a console attached to the Simics model.  Most of the other tests run with serial consoles, as the server workloads are designed to be headless. Simics scripting is used to automate test execution and ensure run-by-run reproducibility

HHVM oss-performance

Our final example is a web-focused workload that uses another virtual machine system, one that targets the PHP and Hack languages, not Java. HHVM (“HipHop Virtual Machine”) is an open-source virtual machine for PHP and Hack, offering a high-performance way to run many web applications and frameworks. The HHVM oss-performance benchmark uses nginx as the web server, underneath HHVM.  To generate traffic to the web server and the application running on it, oss-performance uses the siege benchmark tool, which “besieges” web servers to test their traffic handling ability. 
This entire set of software is run on a single Simics target machine, inside a single Linux OS instance: 
Wind River Simics Diagram 5

 

The hardware resources needed to run the oss-performance benchmark are more modest than the software stacks discussed above. 24 GB of target RAM is sufficient. That illustrates the variation in scale that different software stacks exhibit. For instance, SpecJBB requires an order of magnitude more memory to run, and thus stresses the target system software and hardware in a very different way. 

The oss-performance benchmark is a suite of web application tests which operate in sequence during a single run. Each component benchmark starts by launching nginx and HHVM, and then starts the web application to test on top of HHVM. Once the web application is up and running, the siege benchmarking engine starts, and an application-specific benchmark program runs. Siege connects to nginx using network sockets within the same target machine. Each component benchmark starts from a Simics script. It watches the serial port of the target system for output strings and issues commands to the serial port—a good example of how Simics can automate a long sequence of operations on the target system from the outside, and work through complex operation sequences on the target system. 
Depending on the host and the other loads, each run of the oss-performance benchmark takes about one hour of virtual time, taking six to eight real-world hours to complete. Thus, the slowdown is less than a factor of ten.  

Twenty years of virtual platforms

What we see from current use cases is that Simics does the same thing it did when the technology was new, 20 years ago: it runs software from the real world on virtual platforms modeling the latest hardware, and it runs fast enough that you can use full-size standard software on the virtual platform for software validation, integration testing, and development. 
Workloads now use a few orders of magnitude more memory and processor clock cycles on the virtual platform than on a physical systems. Depending on the simulated processor frequency, a workload that takes an hour to run in virtual time will churn through some 40 to 60 petacycles of target time and tens of trillions of instructions. An OS boot in 1998 required one billion instructions, but today’s OSs require quadrillions
Thankfully, host machines and Simics have scaled up along with the workloads. Just as Simics could run a contemporary server workload in 1998, it can run a contemporary server workload in 2018. In practice, the gap between the simulated machines and the host machines has actually shrunk with the advent of JIT compilers, VMP technology, and multithreading. 
For more information, visit the Wind River Simics page.