Benchmarks: what is it? what is it for? history, types and tips

Benchmarks are an essential part of our daily hardware analysis, they allow us to offer you a scientifically comparable measurement between different components such as CPUs, graphics cards, storage units, etc. Today we are going to dedicate some lines to its history, to its types, how they work, what they measure, what are the most common measures and we will also give you some tips on how to carry them out and which ones we should trust.

What we know today in the PC or mobile world as benchmarks are techniques inherited from the industrial environment that have allowed, since the beginning of this revolution, decision-making based on comparable data in a controlled environment.

The world of modern computing applies these techniques to almost any of its many different domains, and home users have also adopted them as a reliable way to learn about the performance and capabilities of our systems as well as an important point of information when to make important decisions, such as the purchase of our new computer, mobile phone, graphics card, etc.

Today we will talk about the history of PC benchmarks, the types of benchmarks that exist and what components of our system are more suitable for this type of tests that are not only performance.

Index of contents

History

The benchmark or measurement system applies a controlled environment and recognizable measures that are scientifically comparable and verifiable and has coexisted with the world of the computer since it exists. The benchmark, as such, has been democratized to the point that part of its fundamental essence has been lost, which is that it can be auditable and verifiable by third parties. Now we use it more as a quick comparison of performance, but the traceability of its veracity by third parties has certainly been largely lost.

The most classic benchmark methods have always referred to the computing capacity of the system's CPU, although in recent times it has varied between different components, as these have gained preponderance and importance within a computer.

The two most classic units of measurement that are still applied are the Dhrystones and the Whetstones. Both have become, in some way, the basis of all the synthetic benchmarks that we know today.

The oldest is Whetstones (a locality in the United Kingdom where the atomic energy division of the United Kingdom's state power company was located) and Dhrystone came later playing with the name of the former (wet and dry).

The first one was designed in the 70s and the second one is from the 80s and both are the basis of comparative performance that we have had in successive years. Whetstones, simplifying, offered an insight into the computing power of the processor in floating point operations, operations with large numbers of decimals.

The Dhrystone is its counterpart since it is dedicated to basic instructions without decimals, both gave a clear picture of the performance of a processor from two completely different, but complementary approaches. Whetstones and Dhrystone derived into two concepts that we use much more commonly today, MIPS and FLOP.

After these measurements came others such as the FLOP (Floating-point Arithmetic - floating point arithmetic) which is, to a large extent, more important now in a computer than it has ever been because it is the basis of advanced calculation in many modern techniques. such as artificial intelligence algorithms, medical algorithms, weather forecasting, fuzzy logic, encryption, etc.

LINPACK was developed by engineer Jack Dongarra in the 1980s and continues to be used today to measure the floating point computing capacity of all types of systems. Currently there are versions optimized by architecture, CPU manufacturer, etc.

The FLOPS fill our articles on graphics cards (surely the single or double precision sounds familiar), processors and are the basis for calculating power requirements and hardware development for any supercomputer that is in operation or development.

The FLOP is today the most required performance measurement unit in the industry, but it has always been combined with the MIPS (Millions of instructions per second) which is an interesting measurement measure, since it gives us the number of instructions Basic arithmetic that a processor can perform per second, but that depends more on the architecture of the processor (ARM, RISC, x86, etc.) and the programming language than other units of measurement.

As the performance has advanced the multipliers have happened. We now measure the performance of home CPUs in GIPS and GFLOPS. The base remains the same, classical arithmetic operations. Sisoft Sandra continues to offer us this type of measurement in some of its synthetic benchmarks.

The MIPS has also been more relegated to the CPU as a classic element and the FLOP has extended to other thriving areas such as the process capacity or general calculation of former processors very oriented to specific tasks such as the GPUs that we all mount on our processors or on our dedicated expansion cards.

To these basic concepts, time has been adding new units of measurement as much or more important than these in a modern computer or supercomputer. Data transit is one of these measures that has become very important and is currently measured in IOPs (input and output operations per second) and also in other forms such as MB / GB / TB storage measures compared to the time it takes to transit from one point to another (MBps - Megabytes per second).

AS-SSD can measure the performance of a hard disk in MBps or IOPs.

Currently we also use the transfer measure, in its different multipliers, as a way of interpreting the speed of information transit between two points when to emit certain information we actually have to have generated a little more information. This depends on the protocol used for the transfer of information.

A clear example, and that we use a lot, is in the PCI Express interface. Under this protocol, for every 8 bits of information that we want to move (0 or 1s) we have to generate 10 bits of information since that extra information is for control of the communication that is sent for error correction, data integrity, etc.

Other well-known protocols that also introduce this “loss” of real information is the IP, the one you are using to read this article and that makes your 300MT / s connection actually offer a little less than 300mbps speed.

Therefore, we use the Gigatransfer or the transfer when we refer to raw information sent by the interface, and not to the information that is actually processed in the receiver. An 8GT / s PCI Express 3.0 data bus is actually sending 6.4GBps of information for each line connected between the points. Transfer has become very important with the integration of the PCI Express protocol in all the main buses of a home and professional computer.

In recent times we also began to combine measures as a way of relating the processing power with other very important factors in modern computing, with consumption being one of these measures that is introduced as a comparative scale between the performance of two systems. The energy efficiency is as much or more important today than the process power and therefore it is easy to see benchmarks that compare the process power according to the watts of consumption of the element in measurement.

In fact, one of the great lists of supercomputers does not refer so much to the gross power of the computer among all its computing nodes but to the development of that power based on the watts or energy consumed by the entire system. The Green500 list (FLOPS per watt - FLOPS per watt) is a clear example of how consumption is now basic to any self-respecting benchmark, although without a doubt we all continue to look closely at the TOP500 list that does not have this factor as a conditioning factor.

Types of benchmarks

Although we can talk about many more families or types of benchmarks, I will simplify the list in the two most common classes of those that are closest to all of us as more or less advanced users.

On the one hand, we have the synthetic benchmarks that are largely those that offer us measures that we have talked about before. Synthetic benchmarks are programs that perform controlled tests with a more or less stable program code oriented for a specific platform and architecture. They are programs that carry out very specific tests that can integrate one or more of our components, but where the same test or tests are always carried out, without changes.

Image rendering has always been a good method of knowing the performance of a CPU in a modern system since it is a demanding task. Cinebench R15 also has several tests, one for GPU and two for CPU, where we can know the performance of systems with multiple cores and process threads.

They offer a controlled test environment, where there are no changes except for versions and where these changes are properly documented so that the user knows which versions can be compared with each other. These types of programs can test different subsystems of our computer separately, with other pieces of code or specific benchmarks to perform a certain type of test, or combined that can be affected by the performance of one, two or more system components. The benchmark integrated in a game, or programs like Cinebench, Sisoft Sandra, SuperPI, 3DMark,… are clear examples of synthetic benchmarks.

Other synthetic benchmarks that we should not confuse with real benchmarks are those that simulate the execution of real programs, or that execute action scripts in real programs, they are also synthetic since there is no randomness in the test, PC Mark is a clear example of a synthetic benchmark program that we can confuse with a real benchmark.

The actual benchmark is a very different test method because it accepts the randomness of using a program to measure its performance. Players are used to performing this type of benchmarks or performance test when we adjust the quality parameters of a game to the possibilities of our hardware.

Measuring the performance of a game while you play is a real benchmark.

When you open the FPS that the game is giving and try to achieve the desired 60FPS continuously then they are performing a real benchmark. The same can be extrapolated to any other type of program and if you are a developer, when you optimize the code of your program, then you are also doing real benchmark tests where what changes is your code, or the way of executing it, on a platform of stable or variable hardware.

Both types of benchmarks are important, the first ones allow us to compare our system with others in a controlled environment and the second ones are a way to optimize our operation where two important factors are also added, the randomness in the execution and the human factor. Both factors offer an additional point of view on the performance of the component or components that we want to test.

Considerations when benchmarking

For a benchmark to be useful and effective we have to take into account certain factors that are really important. Comparing between different platforms and architectures introduces an important uncertainty factor, which is why this type of benchmarks that give you the ability to compare iOS mobile phones with Windows x86 computers, to give an example, you have to take them with tweezers since it not only changes operating system kernel, but processor architectures are very different. The developers of this type of benchmarks (for example, Geekbench) introduce correction factors between their different versions that are hardly controllable.

Therefore, the first key for a benchmark to be comparable between different hardware is that the test ecosystem is as similar as possible to the benchmark platform, operating system, drivers and software version. There will certainly be elements here that we cannot control homogenize, like the graphics controller if we test AMD graphics against Nvidia graphics, but the rest we have to try to make it as stable as possible. In this case, we would also include hardware, since to compare graphics cards, your thing is to use the same operating system, the same processor, the same memories and all the operating parameters, keeping them the same, including the parameters of quality, resolution and test in the benchmark. The more stable our test ecosystem is, the more reliable and comparable our results will be.

We recommend reading How to know if my processor has a bottleneck?

Another thing that we have to take into account is that benchmark tests normally have a stress factor on the hardware that we are going to test and normally subject this hardware to situations that will not normally occur in the normal use of the system. Every benchmark that we take from our hard drive, graphics card or processor, submits them to situations that can be dangerous for the hardware, so we must establish the appropriate measures so that the stress point does not become a fracture point or also in an element of performance reduction since many components have protection systems with which they reduce their performance in case, for example, of temperatures outside their range of use. Adequate cooling, rest periods between tests, correct feeding of the components under test… everything should be in an ideal situation for the test to run smoothly.

On the other hand, we also use precisely this type of benchmarks in order to subject the system to stress in order to see its stability in this type of situation, it is a different way of applying a benchmark since it is not only seeking to know the performance but also whether the system is stable and even more, if the system performs as it should in these stressful situations.

conclusion

For those of us dedicated to testing computer hardware professionally, the benchmark is a working tool and thanks to it, users have a scientific and verifiable way of comparing or knowing the performance of our next computer in each of its subsystems with precision. comparable to tools used at the industrial level.

A test table, like the one you see in the image, seeks to precisely standardize the test method, so that the comparative benchmark is as reliable as possible and is testable when introducing variations that modify the results.

But like any “laboratory” test, for it to be reliable, the right conditions must be in place for it to be carried out, and even more so for it to be comparable between different systems.

Today we have told you a little about the history of this type of program, its different types, how they work and how to get reliable information from them. They are useful, but for me they are just one more piece of information to keep in mind and I would always place it behind personal experience and active testing with real programs that we are going to use every day.

A benchmarks is fine to put a minimum performance data in our decision process, but they should not be defining of those decisions and, as a last tip, avoid synthetic benchmarks that claim to be able to compare performance between architectures, operating systems, etc.

Table of contents:

History

Types of benchmarks

Considerations when benchmarking

conclusion

Editor's choice