OrcaFlex benchmark

The OrcaFlex benchmark program allows you to measure how fast machines are at performing OrcaFlex calculations.

Downloading and running the OrcaFlex benchmark program

  1. Download the following zip file: OrcaFlexSpeed.zip
  2. Extract all the files from this zip file. We advise you to extract the files to some location on your network (see item 4).
  3. To perform a benchmark, double-click on speed.exe.
  4. The program is a simple console program and it outputs results to the console. In addition it records results in a file called speed.log. It appends to this file so if you save the program to a network location then you will be able to record results for a number of different machines in one file.

The latest version is 11.4a (released November 2023). Results are not directly comparable with older versions of the benchmark program. If you are comparing performance of two or more machines please ensure that you are using the same version of the benchmark program, preferably the latest version.

OrcaFlex benchmark program description

The program is compiled from the same sources as OrcaFlex and operates as follows:

  • Timings are performed for a number of different thread counts, tailored to the number of processors on the machine. For example, on a machine with 8 logical processors, timings are performed for thread count values of 1, 2, 4 and 8.
  • When timing with N threads, N OrcaFlex models are created. Each model is identical and has a single OrcaFlex line with 500 elements. All other data is left at default values.
  • The N threads then perform dynamic simulation of the N models. These simulations are run for 20s of wall clock time, and then paused.
  • The total amount of OrcaFlex simulation time is reported. For example, if thread 1 managed to calculate 90s of simulation, and thread 2 managed 80s, then the total simulation time reported is 170s.
  • This total simulation time is a measure of throughput. Large values correspond to a greater throughput of OrcaFlex analysis.

To make these abstract concepts clearer, here is some sample output:

 

Program Version: 11.4a 64 bit threadlocalmm
WS-EPYC: run at 24/11/2023 11:26:52
    Microsoft Windows 10 Pro 10.0.19045
    CPU=8664 Level=25 Rev=101 Logical processors=256 Physical processor cores=128

    Running 1 simulation using 1 thread...
    Run 1:  362.1s of simulation in 20.0s of real time
    Run 2:  368.2s of simulation in 20.0s of real time
    Run 3:  366.8s of simulation in 20.0s of real time
    -----------------------------------
    Best time                  368.2s
    Average time               365.7s


    Running 2 simulations using 2 threads...
    Run 1:  721.2s of simulation in 20.0s of real time
    Run 2:  731.4s of simulation in 20.0s of real time
    Run 3:  732.1s of simulation in 20.0s of real time
    -----------------------------------
    Best time                  732.1s
    Average time               728.2s
    Theoretical peak scaling     2.00
    Actual scaling               1.99


    Running 4 simulations using 4 threads...
    Run 1: 1313.3s of simulation in 20.0s of real time
    Run 2: 1315.0s of simulation in 20.0s of real time
    Run 3: 1312.2s of simulation in 20.0s of real time
    -----------------------------------
    Best time                 1315.0s
    Average time              1313.5s
    Theoretical peak scaling     4.00
    Actual scaling               3.57


    Running 8 simulations using 8 threads...
    Run 1: 2597.6s of simulation in 20.0s of real time
    Run 2: 2603.0s of simulation in 20.0s of real time
    Run 3: 2585.2s of simulation in 20.0s of real time
    -----------------------------------
    Best time                 2603.0s
    Average time              2595.3s
    Theoretical peak scaling     8.00
    Actual scaling               7.07


    Running 16 simulations using 16 threads...
    Run 1: 4751.3s of simulation in 20.0s of real time
    Run 2: 4794.8s of simulation in 20.0s of real time
    Run 3: 4722.9s of simulation in 20.0s of real time
    -----------------------------------
    Best time                 4794.8s
    Average time              4756.3s
    Theoretical peak scaling    16.00
    Actual scaling              13.02


    Running 32 simulations using 32 threads...
    Run 1: 8634.5s of simulation in 20.0s of real time
    Run 2: 8409.5s of simulation in 20.0s of real time
    Run 3: 8404.4s of simulation in 20.0s of real time
    -----------------------------------
    Best time                 8634.5s
    Average time              8482.8s
    Theoretical peak scaling    32.00
    Actual scaling              23.45


    Running 64 simulations using 64 threads...
    Run 1: 15929.9s of simulation in 20.0s of real time
    Run 2: 15925.7s of simulation in 20.0s of real time
    Run 3: 16028.1s of simulation in 20.0s of real time
    -----------------------------------
    Best time                 16028.1s
    Average time              15961.2s
    Theoretical peak scaling    64.00
    Actual scaling              43.53


    Running 128 simulations using 128 threads...
    Run 1: 24943.7s of simulation in 20.0s of real time
    Run 2: 24930.4s of simulation in 20.0s of real time
    Run 3: 24781.9s of simulation in 20.0s of real time
    -----------------------------------
    Best time                 24943.7s
    Average time              24885.3s
    Theoretical peak scaling   128.00
    Actual scaling              67.74


    Running 256 simulations using 256 threads...
    Run 1: 23166.7s of simulation in 20.0s of real time
    Run 2: 23193.7s of simulation in 20.0s of real time
    Run 3: 23171.5s of simulation in 20.0s of real time
    -----------------------------------
    Best time                 23193.7s
    Average time              23177.3s
    Theoretical peak scaling   256.00
    Actual scaling              62.99


    -----------------------------------
    Throughput for 256 cores   23193.7s
    -----------------------------------

Note that as thread count increases, the simulation time, or throughput, also increases. In an ideal world, two threads would have twice the throughput as a single thread. That theoretical limit is known as linear scaling. However, due to multi-threading overheads, the scaling is not linear and the benchmark program reports the actual scaling that is achieved.

The reporting of scaling can be used to diagnose a machine whose memory and chipset are holding back the processors. Higher specification chipsets will achieve higher levels of scaling. However, even the very best chipsets will not achieve linear scaling.

Another result that can be observed using the reported scaling values is the performance of chips that have two logical processors per core. These chips offer the possibility of increased throughput, but also the risk of increased competition for other computer resources, such as access to memory or to storage. In our experience, using logical processors is sometimes beneficial to throughput, although the scaling level achieved is typically well short of linear. However, there are also machines where using logical processors reduces throughput, as seen in the sample output above.

The bottom line figure from the benchmark program is the final reported value, the throughput for N cores, where N is the total number of logical cores. In the sample output above that is the value of 3018.6 s. When comparing one machine against another, it is this value that should be used in the comparison. This value is the best indicator of OrcaFlex analysis throughput.

For example, suppose you compared an 8 core machine against a 16 core machine. If the 8 core machine reported a throughput of 700s, and the 16 core machine had a throughput of 800s, the 16 core machine should be preferred. Continuing this hypothetical example, the 16 core machine with a throughput only 12% greater than an 8 core machine is probably slower at running individual simulations than the 8 core machine. But because the 16 core machine will achieve a greater overall throughput it is likely the better choice.

The advice above is based on the assumption that your use of OrcaFlex involves running large numbers of simulations. If your typical use involves analysing one model at a time, then you should consider a diagnostic that more closely matches your usage. That said, the overwhelming majority of intensive OrcaFlex use will involve large numbers of distinct simulations so we do believe that the benchmark program will prove useful.