Scientific mathematics
  This is an ancient web page. More recent notes are available here and here

Whether modelling how air flows across a wing or testing the function of the human brain, mathematics plays an important role in our growing understanding of the world. Computers allow scientists to perform vast numbers of computations very quickly. However, even with the fastest computers currently available, it can require days for a computer to do an analysis of brain activity.

Standard desktop PCs using AMD or Intel CPUs are very fast at integer mathematics (e.g. 3*4), but are relatively slow for real number mathematics (e.g. 3.01*4.1). Therefore, scientists have traditionally relied on Sun and Alpha workstations to do their number crunching. These expensive computers are typically much faster than desktop PCs with real number computations.

One solution to increasing the speed of real number processing is to use integers instead. For example, to compute 3.01*4.1, I could get a similar answer using the integers 3010*4100 (as long as I remember I have scaled the input values). These integer maths techniques are very useful for computer graphics, where precision is not critical. However, integer maths of real numbers will either have low precision or a small range. Therefore, this trick is not suitable for most scientific applications.

Recent desktop computers have included 'Single Instruction, Multiple Data' (SIMD) commands that allow computers to do several mathematical computations simultaneously. Motorola chips (used in the Macintosh series) have Altivec instructions, Intel chips support SSE and SSE2 intructions, while AMD chips support 3DNow! instructions. For example, the Intel SSE set allows you to multiply four numbers simultaneously. Programs that support these instructions can potentially run much more quickly.

However, very few scientific programs initially supported SIMD. One problem is that it is generally difficult for programmers to create SIMD programs. A second problem is precision. Most scientific questions require 64-bit precision real numbers (a.k.a. 'double' precision, giving 15-16 significant digits of precision). Unfortunately, the popular SIMD sets (Altivec, SSE, 3DNow!) only support 32-bit real numbers (a.k.a. single precision, giving 7-8 significant digits of precision). Fortunately, the recently released SSE2 supports double precision SIMD (albeit, only two double-precision numbers at a time, instead of four single-precision numbers).

The SSE2 functions of the Pentium 4 processor offer great potential. Most scientists were better served by an Athlon system than a Pentium 4. However, as my software below demonstrates, Penitum 4 optimization can be as easy as recompiling the software with specific compiler directives (e.g. quad-word byte alignment).

Scientists and other individuals interested in seeing the performance benefit of SIMD commands can try my Simdtest program, that runs on Windows computers. It measures the time to complete a large number of mathematical operations using either standard or SIMD commands. It also demonstrates the speed difference between single and double precision standard instructions. The project includes Delphi source code (requires Delphi 4+ and Stefano Tommesani's lovely free Exentia SIMD code). The code also demonstrates how to use the Windows 'QueryPerformanceCounter' to measure time with a fair degree of accuracy.

My program allows you to set the number of computations by adjusting the Array Size and number of repititions (cycles). You can also set whether the data uses 'Default byte alignment' (the operating system returns the memory address) or whether the data is forced into alignment (with addresses evenly divisible by 16) or forced into poor alignment (when addresses are divided by 16, there is always a remainder of 1). When you press 'Compute' you will see the processing time as well as information on the byte alignment. In this example when addresses are divided by 16, the remainders are 8,12,0,12,0 and 4 (shown as s8s12s0 d12d0d4, for the three single precision arrays and three double precision arrays). In this example, all addresses are divisible by 4, but not 8: single multiplication will be fast, but double precision will be slow. When forced alignment is chosen, this label wil report 's0s0s0 d0d0d0', when data is forced to have poor alignment, the label will read 's1s1s1 d1d1d1'. Note that computation times will randomly vary when 'default byte alignment' is chosen, as the arrays will be randomly assigned better alignment on some runs than others.

This software is not designed as any kind of benchmark. However, I think it can illustrate a few simple principles for writing faster floating point software. Here are some values from a few computers (all running Windows XP, except where noted)..

Multiply/Divide Time (ms)

  p:233
[w98]
k6-2:333
[w98]
c:300
[w98]
p3:800
[w2000]
pM: 1000
banias
p4:1300
[w98]
Athlon: 1400 [Linux] p4:2500
Athlon
XP2500
1870
Athlon64
3400+:
2200
dual Xeon:3000
Duo: 2160  Duo2: 2000 Duo 2: 2500 Penryn
single 1070/4307 930/3290 370/2580 130/950 158/769 100/620 136/307 68/359 317/234 89/198 54/287 166/526  51/395 38/186
singleSIMD - 100/310 - 35/220 26/176 35/155 25/111 18/90 17/84 13/73   10/81  6/44 4/23
double 1590/4740 1470/3700 600/2570 450/1300 363/1045 1030/1518 always 8byte aligned always 4byte aligned always 4byte aligned always 4byte aligned always 4byte aligned always 4byte aligned  always 4byte aligned always 4byte algned
doubleALIGNED 1450/4630 1300/3520 480/2570 170/950 83/765 100/620 83/414 61/451 80/234 39/200 40/287 38/353  35/407 24/168
doubleSIMD - - - - 52/640 70/550 - 35/296 - 26/163 26/230 21/296  14/180 10/77

Divides are much slower than multiplies. If you want to divide many numbers by the same value, it is often much quicker to compute the values reciprocal (1/x) and then perform multiplication. For example, instead of dividing a series of numbers by 4, it is much quicker to multiply by 0.25 (i.e. 1/4). In fact, in my test program, the multiplies are always the reciprocals of the values used in the divides, illustrating the difference between divides and multiplies when looking at the same data. Note that in some situations, reciprocals can retain slightly less precision than divides.Note that the Penryn Radix-16 divider reduces this effect.

For Windows: Eight-byte-align data. The Pentium 4 has a reputation for being slow with real numbers when running software that has not been optimized. In particular, there is a large cost for processing 8-byte double precision numbers that are not aligned in memory (where the address of the memory is evenly divisible by 8, a.k.a "aligned on quad-word boundaries"). The red and green boxes above illustrate this - note that multiplies are ten times faster on the Pentium 4 when the data is byte-aligned. It is also worth noting that other fast processors (e.g. Pentium3) also benefit from byte-alignment. With some compilers, byte-alignment is a simple task of setting a compiler directive and recompiling the software (Delphi does not appear to allow this: even the {$A8} directive does NOT align memory allocated with GetMem). Fortunately, the memory managers in modern operating systems do this automatically (e.g. in Linux GetMem returns addresses evenly divisible by 8, while GetMem returns addresses evenly divisible by 4 with WindowsXP). Therefore, users who upgrade to Windows XP will automatically notice better floating-point performance (more so for single precision than double precision, as the addresses are always divisible by four but not necessarily by eight); the software does not need to recompiled.

NaN multiplication is slow with the Pentium 4. My test program also has a button called 'NaN Test' that computes the processing time required to multiply real numbers (specifically 1x1) and "Not A Number" values (specifically 1xNAN). With an AMD Athlon, there is no difference (41ms for real: 41 ms for NaN on a 1800Mhz Athlon XP 2200, 33:33 for a 2200MHz Athlon64 3400+). The Pentium 3 is 14 times slower for NaN calcuations (126:1795ms with a 800 Mhz system), with the Banias PentiumM also somewhat slowers for NaNs (94:2256 for 1Ghz system). Finally, the Pentium 4 is 135 times slower (40:5425 for a 2000Mhz Northwood P4; 61:3225 for a 3000Mhz Xeon). This test is calculated using double precision floats, and my tests were conducted based on 10,000 repititions with a 1024 entry array (10,240,000 calculations). One solution is to use SSE/SSE2 instructions, as SSE NaN calculations do not incur a penalty (i.e. NaNs are computed just as fast as real numbers). For further information, see and this excellent article at Cygnus-Software.

For more detailed information on mathematical functions:

  • Stefano Tommesani's excellent site.