Linux Benchmarking - Article III - Interpreting benchmark results: Benchmarking vs. benchmarketing

2. Benchmarking vs. benchmarketing

There are two basic approaches to benchmarking in the field of computing: the "scientific" or quantitative approach and the "benchmarketing" approach. Both approaches use exactly the same tools, however with a slightly different methodology and of course with widely diverging objectives and results.

2.1 The scientific/quantitative approach

The first approach is to think of benchmarking as a tool for experimentation. Benchmarking is a specific branch of Computer Science, since it produces numbers which can then be mathematically processed and analyzed; this analysis will be later used to draw relevant conclusions about CPU architectures, compiler design, etc.

As with any scientific activity, experiments (benchmark runs and reporting) must follow some basic guidelines or rules:

A good dose of modesty/humility (don't be too ambitious to begin with) and common sense.
No bias or prejudice.
A clearly stated objective related to advancing the state-of-the-art.
Reproducibility.
Accuracy.
Relevance.
Correct logical/statistical inference.
Conciseness.
Sharing of information.
Quoting sources/references.

Of course, this is an idealized view of the scientific community, but these are some of the basic rules for the experimental methods in all branches of Science.

I should stress that benchmarking results in documented quantitative data.

The correct procedure for GNU/Linux benchmarking under this approach is:

Decide on what is the issue that is going to be investigated. It is very important to execute this step before anything else gets started. Stating clearly what we are going to investigate is getting half the work done.
Also note that we are not out to prove anything: we must start with a clean, Zen-like mind. This is particularly difficult for us, GNU/Linux benchmarkers, since we are all utterly convinced that:

GNU/Linux is the best OS in the universe (what "best" means in this context is not clear, however; probably the same as "coolest"),
Wide-SCSI-3 is better than plain EIDE, (idem),
Digital's 64-bit RISC Alpha is the best platform around for GNU/Linux development (idem), and
X Window is a good, modern GUI (no comments).

After purifying our minds and souls ;-), we will have to select the tools (i.e. the benchmarks) which will be used for our benchmarking experiments. You can take a look at my previous article for a selection of GPLed benchmarks. Another way to get the right tool for the job at hand is to devise and implement your own benchmark. This approach takes a lot more time and energy, and sometimes amounts to reinventing the wheel. Creativity being one of the nicest features in the GNU/Linux world, writing a new benchmark is recommended nonetheless, especially in the areas where such tools are sorely missed (Graphics, 3D, multimedia, etc). Summarizing, selecting the appropriate tool for the job is very important.
Now comes the painstaking part: gathering the data. This takes huge amounts of patience and attention to details. See my two previous articles.
And finally we reach the stage of data analysis and logical inference, based on the data we gathered/analyzed. This is also where one can spoil everything by joining the Dark Side of the Force (see subsection 1.2 below). Quoting Andrew Tanenbaum: "Figures don't lie, but liars figure".
If relevant conclusions can be drawn, publishing them on the appropriate mailing lists, newsgroups or on the Linux Gazette is in order. Again this is very much a Zen attitude (known as "coming back to the village").
Just when you thought it was over and you could finally close the cabinet of your CPU after having disassembled it more times than you could count, you get a sympathetic email that mentions a small but fundamental flaw in your entire benchmarking procedure. And you begin to understand that benchmarking is an iterative process, much like self-improvement...

2.2 The benchmarketing approach

This second approach is more popular than the first one, as it serves commercial purposes and gets more subsidies (i.e. grants, sponsorship, money, cash, dinero, l'argent, $) than the first approach. Benchmarketing has one basic objective, and that is to prove that equipment/software A is better (faster, more powerful, better performing or with a better price/performance ratio) than equipment/software B. The basic inspiration for this approach is the Greek philosophical current known as Sophism. Sophistry has had its adepts at all times and ages, but the Greeks made it into a veritable art. Benchmarketers have continued this tradition with varying success (also note that the first Sophists were lawyers (1) see my comment on Intel below). Of course with this approach there is no hope of spiritual redemption... Quoting Larry Wall (of Perl fame) as often quoted by David Niemi:

"Down that path lies madness. On the other hand the road to Hell is paved with melting snowballs."

Benchmarketing results cover the entire range from outright lies to subtle fallacies. Sometimes an excessive amount of data is involved, and in other cases no quantitative data at all is provided; in both cases the task of proving benchmarketing wrong is made more arduous.

A short history of benchmarketing/CPU developments

We already saw that the first widely used benchmark, Whetstone, originated as the result of research into computer architecture and compiler design. So the original Whetstone benchmark can be traced to the "scientific approach".

At the time Whestone was written, computers were indeed rare and very expensive, and the fact that they executed tasks impossible for human beings was enough to justify their purchase by large organizations.

Very soon competition changed this. Foremost among the early benchmarketers was the need to justify the purchase of very expensive mainframes (at the time called supercomputers; these early machines would not even match my < $900 AMD K6 box). This gave rise to a good number of now obsolete benchmarks, as of course each different architecture needed a new algorithm to justify its existence in commercial terms.

This supercomputer market issue is still not over, but two factors contributed to its relative decline:

Jack Dongarra's effort to standardize the LINPACK benchmark as the basic tool for supercomputer benchmarking. This was not entirely successful, as specific "optimizers" were created to make LINPACK run faster on some CPU architectures (note that unless you are trying to solve large scientific problems involving matrix operations - the usual task assigned to most supercomputers - LINPACK is not a good measure of the CPU performance of your GNU/Linux box; anyway, you can find a version of LINPACK ported to C in Al Aburto's excellent FTP site.
The appearance of very fast and cheap superminis, and later microprocessors, and the widespread use of networking technologies. These changed the idea of a centralized computing facility and signaled the end of the supercomputer for most applications. Also modern supercomputers are built with arrays of microprocessors nowadays (notably the latest Cray machines are built using up to 2048 Alpha processors), so there was a shift in focus.

Next in line was the workstation market issue. A nice side-effect of the various marketing initiatives on the part of some competitors (HP, Sun, IBM, SGI among others) is that it spawned the development of various Unix benchmarks that we can now use to benchmark our GNU/Linux boxes!

In parallel to the workstation market development, we saw fierce competition develop in the microprocessor market, with each manufacturer touting its architecture as the "superior" design. In terms of microprocessor architecture an interesting development was the performance issue of CISC against RISC designs. In market terms the dominating architecture is Intel's x86 CISC design (c.f. Computer Architecture, a Quantitative Approach, Hennessy and Patterson, 2nd. edition; there is an excellent 25-page appendix on the x86 architecture).

Recently the demonstrably better-performing Alpha RISC architecture was almost wiped out by Intel lawyers: as a settlement of a complex legal battle over patent infringements, Intel bought Digital's microelectronics operation (which also produced the StrongARM (2) and Digital's highly successful line of Ethernet chips). Note however that Digital kept its Alpha design team and the settlement includes the possibility by Digital to have present and future Alpha chips manufactured by Intel.

The x86 market attracted Intel competitors AMD and more recently Cyrix which created original x86 designs. AMD also bought a small startup called NexGen which designed the precursor to the K6, and Cyrix had to grow under the umbrella of IBM and now National Semiconductor but that's another story altogether. Intel is still the market leader since it has 90% of the microprocessor market, even though both the AMD K6 and Cyrix 6x86MX architectures provide better Linux performance/MHz than Intel's best effort to date, the Pentium II (except for floating-point operations).

Lastly, we have the OS market issue. The Microsoft Windows (R) line of OS's is the overwhelming market leader as far as desktop applications are concerned, but in terms of performance/security/stability/flexibility it sometimes does not compare well with other OSes. Of course, inter-OS benchmarking is a risky business and OS designers are aware of that.

Besides, comparing GNU/Linux to other OSes using benchmarks is almost always an exercise in futility: GNU/Linux is GPLed, whereas no other OS can be said to be free (in the GNU/GPL sense). Can you compare something that is free to something that is proprietary (3) Does benchmarketing apply to something that is free?

Comparing GNU/Linux to other OSes is also a good way to start a nice flame war on comp.os.linux.advocacy, specially when GNU/Linux is compared to BSD Unices or Windows NT. Most debaters don't seem to realize that each OS had different design objectives!

These debates usually reach a steady state when both sides are convinced that they are "right" and that their opponents are "wrong". Sometimes benchmarking data is called in to prove or disprove an argument. But even then we see that this has more to do with benchmarketing than with benchmarking. My $0.02 of advice: avoid such debates like the plague.

Turning benchmarking into benchmarketing

The SPEC95 CPU benchmark suite (the CPU Integer and FP tests, which SPEC calls CINT95/CFP95) is an example of a promising Jedi that succumbed to the Dark side of the Force ;-).

SPEC (Standard Performance Evaluation Corporation) originated as a non-profit corporation with the explicit objective of creating a vendor-independent, objective, non-biased, industry-wide CPU benchmark suite. Founding members were some universities and various CPU and systems manufacturers, such as Intel, HP, Digital, IBM and Motorola.

However, some technical and philosophical issues have developed for historical reasons that make SPEC95 inadequate for Linux benchmarking:

Cost. Strangely enough, SPEC95 benchmarks are free but you have to pay for them: last time I checked, the CINT95/CFP95 cost was $600. The quarterly newsletter was $550. These sums correspond to "administrative costs", according to SPEC.
Licensing. SPEC benchmarks are not placed under GPL. In fact, SPEC95 has a severely limiting license that makes it inadequate for GNU/Linux users. The license is clearly geared to large corporations/organizations: you almost need a full-time employee just to handle all the requisites specified in the license, you cannot freely reproduce the sources, new releases are every three years, etc...
Outright cheating. Recently, a California Court ordered a major microprocessor manufacturer to pay back $50 for each processor sold of a given speed and model, because the manufacturer had distorted SPEC results with a modified version of gcc, and used such results in its advertisements. Benchmarketing seems to have backfired on this occasion.
Comparability. Hennessy and Patterson (see reference above) clearly identify the technical limitations of SPEC92. Basically these have to do with each vendor optimizing benchmark runs for their specific purposes. Even though SPEC95 was released as an update that would work around these limitations, it does not (and cannot, in practical terms) satisfactorily address this issue. Compiler flag issues in SPEC92 prompted SPEC to release a 10-page document entitled "SPEC Run and Reporting Rules for CPU95 Suites". It clearly shows how confident SPEC is that nobody will try to circumvent specific CPU shortcomings with tailor-made compilers/optimizers! Unfortunately, SPEC98 is likely to carry over these problems to the next generation of CPU performance measurements.
Run time. Last but not least, the SPEC95 benchmarks take about 2 days to run on the SPARC reference machine. Note that this in no way makes them more accurate than other CPU benchmarks that run in < 5 minutes (e.g. nbench-byte, presented below)!

Summarizing, if you must absolutely compare CPU performance for different configurations running GNU/Linux, SPEC95 is definitely not the recommended benchmark. On the other hand it's a handy tool for benchmarketers.