There are two basic approaches to benchmarking in the field of computing: the "scientific" or quantitative approach and the "benchmarketing" approach. Both approaches use exactly the same tools, however with a slightly different methodology and of course with widely diverging objectives and results.
The first approach is to think of benchmarking as a tool for experimentation. Benchmarking is a specific branch of Computer Science, since it produces numbers which can then be mathematically processed and analyzed; this analysis will be later used to draw relevant conclusions about CPU architectures, compiler design, etc.
As with any scientific activity, experiments (benchmark runs and reporting) must follow some basic guidelines or rules:
Of course, this is an idealized view of the scientific community, but these are some of the basic rules for the experimental methods in all branches of Science.
I should stress that benchmarking results in documented quantitative data.
The correct procedure for GNU/Linux benchmarking under this approach is:
This second approach is more popular than the first one, as it serves commercial purposes and gets more subsidies (i.e. grants, sponsorship, money, cash, dinero, l'argent, $) than the first approach. Benchmarketing has one basic objective, and that is to prove that equipment/software A is better (faster, more powerful, better performing or with a better price/performance ratio) than equipment/software B. The basic inspiration for this approach is the Greek philosophical current known as Sophism. Sophistry has had its adepts at all times and ages, but the Greeks made it into a veritable art. Benchmarketers have continued this tradition with varying success (also note that the first Sophists were lawyers (1) see my comment on Intel below). Of course with this approach there is no hope of spiritual redemption... Quoting Larry Wall (of Perl fame) as often quoted by David Niemi:
"Down that path lies madness. On the other hand the road to Hell is paved with melting snowballs."
Benchmarketing results cover the entire range from outright lies to subtle fallacies. Sometimes an excessive amount of data is involved, and in other cases no quantitative data at all is provided; in both cases the task of proving benchmarketing wrong is made more arduous.
We already saw that the first widely used benchmark, Whetstone, originated as the result of research into computer architecture and compiler design. So the original Whetstone benchmark can be traced to the "scientific approach".
At the time Whestone was written, computers were indeed rare and very expensive, and the fact that they executed tasks impossible for human beings was enough to justify their purchase by large organizations.
Very soon competition changed this. Foremost among the early benchmarketers was the need to justify the purchase of very expensive mainframes (at the time called supercomputers; these early machines would not even match my < $900 AMD K6 box). This gave rise to a good number of now obsolete benchmarks, as of course each different architecture needed a new algorithm to justify its existence in commercial terms.
This supercomputer market issue is still not over, but two factors contributed to its relative decline:
Next in line was the workstation market issue. A nice side-effect of the various marketing initiatives on the part of some competitors (HP, Sun, IBM, SGI among others) is that it spawned the development of various Unix benchmarks that we can now use to benchmark our GNU/Linux boxes!
In parallel to the workstation market development, we saw fierce competition develop in the microprocessor market, with each manufacturer touting its architecture as the "superior" design. In terms of microprocessor architecture an interesting development was the performance issue of CISC against RISC designs. In market terms the dominating architecture is Intel's x86 CISC design (c.f. Computer Architecture, a Quantitative Approach, Hennessy and Patterson, 2nd. edition; there is an excellent 25-page appendix on the x86 architecture).
Recently the demonstrably better-performing Alpha RISC architecture was almost wiped out by Intel lawyers: as a settlement of a complex legal battle over patent infringements, Intel bought Digital's microelectronics operation (which also produced the StrongARM (2) and Digital's highly successful line of Ethernet chips). Note however that Digital kept its Alpha design team and the settlement includes the possibility by Digital to have present and future Alpha chips manufactured by Intel.
The x86 market attracted Intel competitors AMD and more recently Cyrix which created original x86 designs. AMD also bought a small startup called NexGen which designed the precursor to the K6, and Cyrix had to grow under the umbrella of IBM and now National Semiconductor but that's another story altogether. Intel is still the market leader since it has 90% of the microprocessor market, even though both the AMD K6 and Cyrix 6x86MX architectures provide better Linux performance/MHz than Intel's best effort to date, the Pentium II (except for floating-point operations).
Lastly, we have the OS market issue. The Microsoft Windows (R) line of OS's is the overwhelming market leader as far as desktop applications are concerned, but in terms of performance/security/stability/flexibility it sometimes does not compare well with other OSes. Of course, inter-OS benchmarking is a risky business and OS designers are aware of that.
Besides, comparing GNU/Linux to other OSes using benchmarks is almost always an exercise in futility: GNU/Linux is GPLed, whereas no other OS can be said to be free (in the GNU/GPL sense). Can you compare something that is free to something that is proprietary (3) Does benchmarketing apply to something that is free?
Comparing GNU/Linux to other OSes is also a good way to start a nice flame war on comp.os.linux.advocacy, specially when GNU/Linux is compared to BSD Unices or Windows NT. Most debaters don't seem to realize that each OS had different design objectives!
These debates usually reach a steady state when both sides are convinced that they are "right" and that their opponents are "wrong". Sometimes benchmarking data is called in to prove or disprove an argument. But even then we see that this has more to do with benchmarketing than with benchmarking. My $0.02 of advice: avoid such debates like the plague.
The SPEC95 CPU benchmark suite (the CPU Integer and FP tests, which SPEC calls CINT95/CFP95) is an example of a promising Jedi that succumbed to the Dark side of the Force ;-).
SPEC (Standard Performance Evaluation Corporation) originated as a non-profit corporation with the explicit objective of creating a vendor-independent, objective, non-biased, industry-wide CPU benchmark suite. Founding members were some universities and various CPU and systems manufacturers, such as Intel, HP, Digital, IBM and Motorola.
However, some technical and philosophical issues have developed for historical reasons that make SPEC95 inadequate for Linux benchmarking:
Summarizing, if you must absolutely compare CPU performance for different configurations running GNU/Linux, SPEC95 is definitely not the recommended benchmark. On the other hand it's a handy tool for benchmarketers.