OrcaFlex benchmarking and hardware recommendations

Hardware recommendations

We are unable to make specific recommendations about what hardware is best to run OrcaFlex because the technology available changes very quickly. Multi-core chips provide the best performance to price ratio at the current time. Other than that we simply recommend that you obtain the latest processor/chipset/memory in whatever family fits your other IT and purchasing requirements.

If you wish to make comparisons of performance when running OrcaFlex then we do provide a benchmark program. This can be used by you and your suppliers to compare computer performance to determine the optimum machine to purchase.

OrcaFlex Benchmark program

The OrcaFlex benchmark program allows you to measure how fast machines are at performing OrcaFlex calculations.

Downloading and running the OrcaFlex benchmark program

  1. Download OrcaFlexSpeed.zip (8.1 MB).
  2. Extract all the files from this zip file. We advise you to extract the files to some location on your network (see item 4).
  3. To perform a benchmark, double-click on either speed.exe (32 bit) or speed64.exe (64 bit) according to the architecture of your system.
  4. The program is a simple console program and it outputs results to the console. In addition it records results in a file called speed.log. It appends to this file so if you save the program to a network location then you will be able to record results for a number of different machines in one file.

The latest version is 10.1a (released October 2016). Results are not directly comparable with older versions of the benchmark program. If you are comparing performance of two or more machines please ensure that you are using the same version of the benchmark program, preferably the latest version.

OrcaFlex benchmark program description

The program is compiled from the same sources as OrcaFlex and operates as follows:

  • Timings are performed for a number of different thread counts, tailored to the number of processors on the machine. For example, on a machine with 8 logical processors, timings are performed for thread count values of 1, 2, 4 and 8.
  • When timing with N threads, N OrcaFlex models are created. Each model is identical and has a single OrcaFlex line with 500 elements. All other data is left at default values.
  • The N threads then perform dynamic simulation of the N models. These simulations are run for 20s of wall clock time, and then paused.
  • The total amount of OrcaFlex simulation time is reported. For example, if thread 1 managed to calculate 90s of simulation, and thread 2 managed 80s, then the total simulation time reported is 170s.
  • This total simulation time is a measure of throughput. Large values correspond to a greater throughput of OrcaFlex analysis.

To make these abstract concepts clearer, here is some sample output:

Program Version: 10.1a 64 bit nodelocalmm
Microsoft Windows 7 Professional  6.1.7601
CPU=8664 Level=6 Rev=1A05 Logical processors=8 Physical processor cores=4

Running 1 simulation using 1 thread...
Run 1:  119.1s of simulation in 20.0s of real time
Run 2:  119.4s of simulation in 20.0s of real time
Run 3:  118.0s of simulation in 20.0s of real time
-----------------------------------
Best time                  119.4s
Average time               118.8s

Running 2 simulations using 2 threads...
Run 1:  214.4s of simulation in 20.0s of real time
Run 2:  217.4s of simulation in 20.0s of real time
Run 3:  221.2s of simulation in 20.0s of real time
-----------------------------------
Best time                  221.2s
Average time               217.7s
Theoretical peak scaling     2.00
Actual scaling               1.85

Running 4 simulations using 4 threads...
Run 1:  384.8s of simulation in 20.0s of real time
Run 2:  391.1s of simulation in 20.0s of real time
Run 3:  395.1s of simulation in 20.0s of real time
-----------------------------------
Best time                  395.1s
Average time               390.3s
Theoretical peak scaling     4.00
Actual scaling               3.31

Running 8 simulations using 8 threads...
Run 1:  646.3s of simulation in 20.0s of real time
Run 2:  627.2s of simulation in 20.0s of real time
Run 3:  659.8s of simulation in 20.0s of real time
-----------------------------------
Best time                  659.8s
Average time               644.4s
Theoretical peak scaling     8.00
Actual scaling               5.53

-----------------------------------
Throughput for 8 cores     659.8s
-----------------------------------

Note that as thread count increases, the simulation time, or throughput, also increases. In an ideal world, two threads would have twice the throughput as a single thread. That theoretical limit is known as linear scaling. However, due to multi-threading overheads, the scaling is not linear and the benchmark program reports the actual scaling that is achieved.

The reporting of scaling can be used to diagnose a machine whose memory and chipset are holding back the processors. Higher specification chipsets will achieve higher levels of scaling. However, even the very best chipsets will not achieve linear scaling.

Another factor that can be observed with the reported scaling values is the performance of chips that have two logical processors per core. In our experience, such architectures are beneficial to throughput. Although the scaling level achieved is typically well short of linear, the throughput is increased when using these logical processors.

The bottom line figure from the benchmark program is the final reported value, the throughput for N cores, where N is the total number of logical cores. In the sample output above that is the value of 659.8s. When comparing one machine against another, it is this value that should be used in the comparison. This value is the best indicator of OrcaFlex analysis throughput.

So, for example, suppose you compared an 8 core machine against a 16 core machine. If the 8 core machine reported a throughput of 700s, and the 16 core machine had a throughput of 800s, the 16 core machine should be preferred. Continuing this hypothetrical example, the 16 core machine with a throughput only 12% greater than an 8 core machine is probably slower at running individual simulations than the 8 core machine. But because the 16 core machine will achieve a greater overall throughput it is likely the better choice.

The advice above is based on the assumption that your use of OrcaFlex involves running large numbers of simulations. If your typical use involves analysing one model at a time, then you should consider a diagnostic that more closely matches your usage scenario. That said, the overwhelming majority of intensive OrcaFlex use will involve large numbers of distinct simulations so we do believe that the benchmark program will prove accurate.