PreviousNextIndex

Performance tuning


Cyrix/IBM CPUs provide very good performance under Linux by default. They also have many features that allow fine-tuning for optimum performance in specific cases. These can be tested very easily with set6x86 version 1.5, since the 6x86_reg utility makes it easy to visualize the 6x86 register settings.

Note, however, that no benchmark results are available to verify what real gains in performance can be obtained using these special 6x86 features. If you want to explore these options, I suggest you take a look at my Linux Benchmarking HOWTO, and also check the Linux Benchmarking Project pages. You may also want to read my Linux Benchmarking articles in the #22, #23 and #24 issues of the Linux Gazette.

On the other hand, Koen Gadeyne has shown that correctly setting an ARR for the video RAM will allow a much greater video memory bandwidth when accessing the Linear Address Video Memory Buffer, if you have a PCI Video Card. Yes, Doom will run faster under X with this feature turned on!

This is quite easy to verify, if you have got XFree 3.2.x or 3.3.x:

  1. Enter X (in 8 bit color depth mode) and call the program dga, found in /usr/X11R6/bin, from an xterm.
  2. Don't worry if your screen colors go wild, this is normal. Now type the letter "b", wait a second and type "q" to end the test.
  3. Your xterm will be displaying the measured processor/video RAM bandwidth.
  4. Repeat the test with/without setting an ARR. You should get anything from 10 to 50 % improvement with the ARR correctly set.

You may want to experiment with various parameters/configurations. As mentionned in the set6x86 README, if it breaks, you get to keep the pieces...

Gcc optimizations for 6x86 family CPUs

C source code compiled with gcc for x86 processors does not need any special switches for maximum performance. Usually, -O2 will provide performance within 1% of what can be obtained by adding a dozen or so specific switches.

This is due in part to the advanced architectural features found on 6x86 family CPUs: out-of-order execution, register renaming, branch prediction, speculative execution, superpipelining, etc.

Consequently, there is little point in using a Pentium optimized compiler, or in trying to create a 6x86 optimized version of gcc. There are few instruction pairing rules that can be implemented with 6x86 CPUs and these will provide marginal (< 1%) performance gains in normal use.

On the other hand, there is an optimization sometimes used with 486 and Pentium CPUs that should NOT be used with 6x86 family CPUs: NOP padding to align on 32 byte boundaries. This will increase the size of the executable code and result in lower hit rates in the L1 and L2 caches.

6x86 vs. Pentium FPU performance

Although I consider the floating-point performance of 6x86 CPUs more than adequate, synthetic benchmarks have demonstrated that on pure floating point instruction sequences Pentium CPUs are up to 50% faster than the 6x86 CPUs at the same clock speed. In other words, a 6x86 PR-166+ clocked at 133 MHz will do just as well as a 90 MHz Pentium on these benchmarks (you will find a short description of the Whetstone floating-point benchmark in my article on Linux Benchmarking in the October issue of the Linux Gazette).

Now, of course, floating-point (FP) instructions represent on average < 0.001% of the instructions executed on most GNU/Linux boxes.

You can check that FP instructions are very rarely executed by verifying that unless running FP computational code, there is very little code that uses the FPU when running GNU/Linux in day-to-day use. I think X uses FP code to scale fonts the first time it loads a new font, but that's about all.

Just an example: in the RC-5 cracking effort, the algorithm used to search the key uses only integer instructions, and the 6x86 chips easily beat their Pentium and Pentium MMX counterparts, thanks to a more efficient implementation of the ROTL instruction.

On the other hand, if you are into multimedia applications and wish to use one of the standard MPEG decoders available for Linux, then the 6x86 will clearly provide performance below that of a Pentium, because of its slower FPU. This shortcoming was pointed out to me by Orlando Andico, whom I am trying to coax into writing a multimedia benchmark :-). The 6x86MX should do much better, specially if the MPEG decoding code is MMX optimized, but it will still be slower than a Pentium MMX.

It is a pity that most CPU manufacturers dedicate such a large % of the CPU die area (approximately 11% for the original Pentium chip) and R&D expenses to sometimes bugged FPU implementations. What's more, the Linux kernel has excellent 387 emulation (actually the Linux FPU emulator was designed to match the FPU in a 486DX), which can be used for the occasional FPU code.

Note that this does not mean that the software emulation will provide anywhere near dedicated FPU performance. It may be up to 100 times slower!

A simple way to test for the occurrence of FP instructions in normal GNU/Linux execution flow, without resorting to expensive tools like In Circuit Emulators (an ICE is an expensive piece of hardware that emulates a processor using special discrete logic), is to use the internal performance monitoring systems found in the Pentium and 6x86MX chips.This mechanism allows simultaneous counting of different events (like FPU and total instructions executed) on two separate registers, plus keeping a count of total cycles executed on a Time Stamp Counter. So on GNU/Linux systems with these CPUs it would take a small assembly language program to verify the percentage guessed above.


PreviousNextIndex

Last updated on January 4, 1998.

Copyright 1997 Andrew D. Balsa