Scientific mathematics |
This is an ancient web page. More recent notes are available
here and
here
Whether modelling how air flows across a wing or testing the function of the human brain, mathematics plays an important role in our growing understanding of the world. Computers allow scientists to perform vast numbers of computations very quickly. However, even with the fastest computers currently available, it can require days for a computer to do an analysis of brain activity. Standard desktop PCs using AMD or Intel CPUs are very fast at integer mathematics (e.g. 3*4), but are relatively slow for real number mathematics (e.g. 3.01*4.1). Therefore, scientists have traditionally relied on Sun and Alpha workstations to do their number crunching. These expensive computers are typically much faster than desktop PCs with real number computations. One solution to increasing the speed of real number processing is to use integers instead. For example, to compute 3.01*4.1, I could get a similar answer using the integers 3010*4100 (as long as I remember I have scaled the input values). These integer maths techniques are very useful for computer graphics, where precision is not critical. However, integer maths of real numbers will either have low precision or a small range. Therefore, this trick is not suitable for most scientific applications. Recent desktop computers have included 'Single Instruction, Multiple Data' (SIMD) commands that allow computers to do several mathematical computations simultaneously. Motorola chips (used in the Macintosh series) have Altivec instructions, Intel chips support SSE and SSE2 intructions, while AMD chips support 3DNow! instructions. For example, the Intel SSE set allows you to multiply four numbers simultaneously. Programs that support these instructions can potentially run much more quickly. However, very few scientific programs initially supported SIMD. One problem is that it is generally difficult for programmers to create SIMD programs. A second problem is precision. Most scientific questions require 64-bit precision real numbers (a.k.a. 'double' precision, giving 15-16 significant digits of precision). Unfortunately, the popular SIMD sets (Altivec, SSE, 3DNow!) only support 32-bit real numbers (a.k.a. single precision, giving 7-8 significant digits of precision). Fortunately, the recently released SSE2 supports double precision SIMD (albeit, only two double-precision numbers at a time, instead of four single-precision numbers). The SSE2 functions of the Pentium 4 processor offer great potential. Most scientists were better served by an Athlon system than a Pentium 4. However, as my software below demonstrates, Penitum 4 optimization can be as easy as recompiling the software with specific compiler directives (e.g. quad-word byte alignment). Scientists and other individuals interested in seeing the performance benefit of SIMD commands can try my Simdtest program, that runs on Windows computers. It measures the time to complete a large number of mathematical operations using either standard or SIMD commands. It also demonstrates the speed difference between single and double precision standard instructions. The project includes Delphi source code (requires Delphi 4+ and Stefano Tommesani's lovely free Exentia SIMD code). The code also demonstrates how to use the Windows 'QueryPerformanceCounter' to measure time with a fair degree of accuracy.
This software is not designed as any kind of benchmark. However, I think it can illustrate a few simple principles for writing faster floating point software. Here are some values from a few computers (all running Windows XP, except where noted)..
Divides are much slower than multiplies. If you want to divide many numbers by the same value, it is often much quicker to compute the values reciprocal (1/x) and then perform multiplication. For example, instead of dividing a series of numbers by 4, it is much quicker to multiply by 0.25 (i.e. 1/4). In fact, in my test program, the multiplies are always the reciprocals of the values used in the divides, illustrating the difference between divides and multiplies when looking at the same data. Note that in some situations, reciprocals can retain slightly less precision than divides.Note that the Penryn Radix-16 divider reduces this effect. For Windows: Eight-byte-align data. The Pentium 4 has a reputation for being slow with real numbers when running software that has not been optimized. In particular, there is a large cost for processing 8-byte double precision numbers that are not aligned in memory (where the address of the memory is evenly divisible by 8, a.k.a "aligned on quad-word boundaries"). The red and green boxes above illustrate this - note that multiplies are ten times faster on the Pentium 4 when the data is byte-aligned. It is also worth noting that other fast processors (e.g. Pentium3) also benefit from byte-alignment. With some compilers, byte-alignment is a simple task of setting a compiler directive and recompiling the software (Delphi does not appear to allow this: even the {$A8} directive does NOT align memory allocated with GetMem). Fortunately, the memory managers in modern operating systems do this automatically (e.g. in Linux GetMem returns addresses evenly divisible by 8, while GetMem returns addresses evenly divisible by 4 with WindowsXP). Therefore, users who upgrade to Windows XP will automatically notice better floating-point performance (more so for single precision than double precision, as the addresses are always divisible by four but not necessarily by eight); the software does not need to recompiled. NaN multiplication is slow with the Pentium 4. My test program also has a button called 'NaN Test' that computes the processing time required to multiply real numbers (specifically 1x1) and "Not A Number" values (specifically 1xNAN). With an AMD Athlon, there is no difference (41ms for real: 41 ms for NaN on a 1800Mhz Athlon XP 2200, 33:33 for a 2200MHz Athlon64 3400+). The Pentium 3 is 14 times slower for NaN calcuations (126:1795ms with a 800 Mhz system), with the Banias PentiumM also somewhat slowers for NaNs (94:2256 for 1Ghz system). Finally, the Pentium 4 is 135 times slower (40:5425 for a 2000Mhz Northwood P4; 61:3225 for a 3000Mhz Xeon). This test is calculated using double precision floats, and my tests were conducted based on 10,000 repititions with a 1024 entry array (10,240,000 calculations). One solution is to use SSE/SSE2 instructions, as SSE NaN calculations do not incur a penalty (i.e. NaNs are computed just as fast as real numbers). For further information, see and this excellent article at Cygnus-Software. For more detailed information on mathematical functions:
|