A benchmark is a documented procedure that will measure the time needed by a computer system to execute a well-defined computing task. It is assumed that this time is related to the performance of the computer system and that somehow the same procedure can be applied to other systems, so that comparisons can be made between different hardware/software configurations.
From the definition of a benchmark, one can easily deduce that there are two basic procedures for benchmarking:
If a single iteration of our test code takes a long time to execute, procedure 1 will be preferred. On the other hand, if the system being tested is able to execute thousands of iterations of our test code per second, procedure 2 should be chosen.
Both procedures 1 and 2 will yield final results in the form "seconds/iteration" or "iterations/second" (these two forms are interchangeable). One could imagine other algorithms, e.g. self-modifying code or measuring the time needed to reach a steady state of some sort, but this would increase the complexity of the code and produce results that would probably be next to impossible to analyze and compare.
Sometimes, figures obtained from standard benchmarks on a system being tested are compared with the results obtained on a reference machine. The reference machine's results are called the baseline results. If we divide the results of the system under examination by the baseline results, we obtain a performance index. Obviously, the performance index for the reference machine is 1.0. An index has no units, it is just a relative measurement.
The final result of any benchmarking procedure is always a set of numerical results which we can call speed or performance (for that particular aspect of our system effectively tested by the piece of code).
Under certain conditions we can combine results from similar tests or various indices into a single figure, and the term metric will be used to describe the "units" of performance for this benchmarking mix.
Time measurements for benchmarking purposes are usually taken by defining a starting time and an ending time, the difference between the two being the elapsed wall-clock time. Wall-clock means we are not considering just CPU time, but the "real" time usually provided by an internal asynchronous real-time clock source in the computer or an external clock source (your wrist-watch for example). Some tests, however, make use of CPU time: the time effectively spent by the CPU of the system being tested in running the specific benchmark, and not other OS routines.
Resolution and precision both measure the information provided by a data point, but should not be confused.
Resolution is the minimum time interval that can be (easily) measured on a given system. In Linux running on i386 architectures I believe this is 1/100 of a second, provided by the GNU C system library function
times (see /usr/include/time.h - not very clear, BTW). Another term used with the same meaning is "granularity". David C. Niemi has developed an interesting technique to lower granularity to very low (sub-millisecond) levels on Linux systems, I hope he will contribute an explanation of his algorithm in the next article.
Precision is a measure of the total variability in the results for any given benchmark. Computers are deterministic systems and should always provide the same, identical benchmark results if running under identical conditions. However, since Linux is a multi-tasking, multi-user system, some tasks will be running in the background and will eventually influence the benchmark results.
This "random" error can be expressed as a time measurement (e.g. 20 seconds + or - 0.2 s) or as a percentage of the figure obtained by the benchmark considered (e.g. 20 seconds + or - 1%). Other terms sometimes used to describe variations in results are "variance", "noise", or "jitter".
Note that whereas resolution is system dependent, precision is a characteristic of each benchmark. Ideally, a well-designed benchmark will have a precision smaller than or equal to the resolution of the system being tested. It is very important to identify the sources of noise for any particular benchmark, since this provides an indication of possibly erroneous results.
A program or program suite specifically designed to measure the performance of a subsystem (hardware, software, or a combination of both). Whetstone is an example of a synthetic benchmark.
A commonly executed application is chosen and the time to execute a given task with this application is used as a benchmark. Application benchmarks try to measure the performance of computer systems for some category of real-world computing task. Measuring the time your Linux box takes to compile the kernel can be considered as a sort of application benchmark.
A benchmark or its results are said to be irrelevant when they fail to effectively measure the performance characteristic the benchmark was designed for. Conversely, benchmark results are said to be relevant when they allow an accurate prediction of real-life performance or meaningful comparisons between different systems.