Scientific mathematics

This is an ancient web page. More recent notes are available here and here

Whether modelling how air flows across a wing or testing the function of the human brain, mathematics plays an important role in our growing understanding of the world. Computers allow scientists to perform vast numbers of computations very quickly. However, even with the fastest computers currently available, it can require days for a computer to do an analysis of brain activity.

Standard desktop PCs using AMD or Intel CPUs are very fast at integer mathematics (e.g. 3*4), but are relatively slow for real number mathematics (e.g. 3.01*4.1). Therefore, scientists have traditionally relied on Sun and Alpha workstations to do their number crunching. These expensive computers are typically much faster than desktop PCs with real number computations.

One solution to increasing the speed of real number processing is to use integers instead. For example, to compute 3.01*4.1, I could get a similar answer using the integers 3010*4100 (as long as I remember I have scaled the input values). These integer maths techniques are very useful for computer graphics, where precision is not critical. However, integer maths of real numbers will either have low precision or a small range. Therefore, this trick is not suitable for most scientific applications.

Recent desktop computers have included 'Single Instruction, Multiple Data' (SIMD) commands that allow computers to do several mathematical computations simultaneously. Motorola chips (used in the Macintosh series) have Altivec instructions, Intel chips support SSE and SSE2 intructions, while AMD chips support 3DNow! instructions. For example, the Intel SSE set allows you to multiply four numbers simultaneously. Programs that support these instructions can potentially run much more quickly.

However, very few scientific programs initially supported SIMD. One problem is that it is generally difficult for programmers to create SIMD programs. A second problem is precision. Most scientific questions require 64-bit precision real numbers (a.k.a. 'double' precision, giving 15-16 significant digits of precision). Unfortunately, the popular SIMD sets (Altivec, SSE, 3DNow!) only support 32-bit real numbers (a.k.a. single precision, giving 7-8 significant digits of precision). Fortunately, the recently released SSE2 supports double precision SIMD (albeit, only two double-precision numbers at a time, instead of four single-precision numbers).

The SSE2 functions of the Pentium 4 processor offer great potential. Most scientists were better served by an Athlon system than a Pentium 4. However, as my software below demonstrates, Penitum 4 optimization can be as easy as recompiling the software with specific compiler directives (e.g. quad-word byte alignment).

Scientists and other individuals interested in seeing the performance benefit of SIMD commands can try my Simdtest program, that runs on Windows computers. It measures the time to complete a large number of mathematical operations using either standard or SIMD commands. It also demonstrates the speed difference between single and double precision standard instructions. The project includes Delphi source code (requires Delphi 4+ and Stefano Tommesani's lovely free Exentia SIMD code). The code also demonstrates how to use the Windows 'QueryPerformanceCounter' to measure time with a fair degree of accuracy.

My program allows you to set the number of computations by adjusting the Array Size and number of repititions (cycles). You can also set whether the data uses 'Default byte alignment' (the operating system returns the memory address) or whether the data is forced into alignment (with addresses evenly divisible by 16) or forced into poor alignment (when addresses are divided by 16, there is always a remainder of 1). When you press 'Compute' you will see the processing time as well as information on the byte alignment. In this example when addresses are divided by 16, the remainders are 8,12,0,12,0 and 4 (shown as s8s12s0 d12d0d4, for the three single precision arrays and three double precision arrays). In this example, all addresses are divisible by 4, but not 8: single multiplication will be fast, but double precision will be slow. When forced alignment is chosen, this label wil report 's0s0s0 d0d0d0', when data is forced to have poor alignment, the label will read 's1s1s1 d1d1d1'. Note that computation times will randomly vary when 'default byte alignment' is chosen, as the arrays will be randomly assigned better alignment on some runs than others.

This software is not designed as any kind of benchmark. However, I think it can illustrate a few simple principles for writing faster floating point software. Here are some values from a few computers (all running Windows XP, except where noted)..

Multiply/Divide Time (ms)

p:233
[w98]

k6-2:333
[w98]

c:300
[w98]

p3:800
[w2000]

pM: 1000
banias

p4:1300
[w98]

Athlon: 1400 [Linux]

p4:2500

Athlon
XP2500
1870

Athlon64
3400+:
2200

dual Xeon:3000

Duo: 2160

Duo2: 2000

Duo 2: 2500 Penryn

single

1070/4307

930/3290

370/2580

130/950

158/769

100/620

136/307

68/359

317/234

89/198

54/287

166/526

51/395

38/186

singleSIMD

100/310

35/220

26/176

35/155

25/111

18/90

17/84

13/73

10/81

6/44

4/23

double

1590/4740

1470/3700

600/2570

450/1300

363/1045

1030/1518

always 8byte aligned

always 4byte aligned

always 4byte algned

doubleALIGNED

1450/4630

1300/3520

480/2570

170/950

83/765

100/620

83/414

61/451

80/234

39/200

40/287

38/353

35/407

24/168

doubleSIMD

52/640

70/550

35/296

26/163

26/230

21/296

14/180

10/77

Divides are much slower than multiplies. If you want to divide many numbers by the same value, it is often much quicker to compute the values reciprocal (1/x) and then perform multiplication. For example, instead of dividing a series of numbers by 4, it is much quicker to multiply by 0.25 (i.e. 1/4). In fact, in my test program, the multiplies are always the reciprocals of the values used in the divides, illustrating the difference between divides and multiplies when looking at the same data. Note that in some situations, reciprocals can retain slightly less precision than divides.Note that the Penryn Radix-16 divider reduces this effect.

For Windows: Eight-byte-align data. The Pentium 4 has a reputation for being slow with real numbers when running software that has not been optimized. In particular, there is a large cost for processing 8-byte double precision numbers that are not aligned in memory (where the address of the memory is evenly divisible by 8, a.k.a "aligned on quad-word boundaries"). The red and green boxes above illustrate this - note that multiplies are ten times faster on the Pentium 4 when the data is byte-aligned. It is also worth noting that other fast processors (e.g. Pentium3) also benefit from byte-alignment. With some compilers, byte-alignment is a simple task of setting a compiler directive and recompiling the software (Delphi does not appear to allow this: even the {$A8} directive does NOT align memory allocated with GetMem). Fortunately, the memory managers in modern operating systems do this automatically (e.g. in Linux GetMem returns addresses evenly divisible by 8, while GetMem returns addresses evenly divisible by 4 with WindowsXP). Therefore, users who upgrade to Windows XP will automatically notice better floating-point performance (more so for single precision than double precision, as the addresses are always divisible by four but not necessarily by eight); the software does not need to recompiled.

NaN multiplication is slow with the Pentium 4. My test program also has a button called 'NaN Test' that computes the processing time required to multiply real numbers (specifically 1x1) and "Not A Number" values (specifically 1xNAN). With an AMD Athlon, there is no difference (41ms for real: 41 ms for NaN on a 1800Mhz Athlon XP 2200, 33:33 for a 2200MHz Athlon64 3400+). The Pentium 3 is 14 times slower for NaN calcuations (126:1795ms with a 800 Mhz system), with the Banias PentiumM also somewhat slowers for NaNs (94:2256 for 1Ghz system). Finally, the Pentium 4 is 135 times slower (40:5425 for a 2000Mhz Northwood P4; 61:3225 for a 3000Mhz Xeon). This test is calculated using double precision floats, and my tests were conducted based on 10,000 repititions with a 1024 entry array (10,240,000 calculations). One solution is to use SSE/SSE2 instructions, as SSE NaN calculations do not incur a penalty (i.e. NaNs are computed just as fast as real numbers). For further information, see and this excellent article at Cygnus-Software.

For more detailed information on mathematical functions:

Stefano Tommesani's excellent site.