From cbloodworth@juno.com Mon Jun 21 12:30:55 1999

On May 23, Dominique Delande (Paris France) started a 1 billion digit pi run using v2.3 of Carey Bloodworth's (Arkansas USA) desktop pi program ('billion' means here 2 to the power of 30, in the usual computer number sense). The computation took 758,058 seconds.

A second run was then started with a different formula and a different multiplication module. This run took 711,552 seconds. Of the 1,073,741,824 digits computed, only the last 143 differed. Therefore, it can be safely assumed that 1,073,741,500 digits are correct.

Dominique used a Dell workstation 610 with two Pentium II Xeon 450 MHz processors, with 1 GByte RAM. The OS was Linux (Redhat 5.2 with a 2.2.6 kernel), and the latest version (egcs-1.1.2) of the GNU C compiler was used. Although the program could be parallelized, this has not yet been done, so only one processor was used. 512m bytes and approximately 8g bytes of disk were used.

Although the hardware cost approximatly US$8000 at the time it was purchased, a comparably configured computer should be much cheaper now, especially since only a single processor was used, and only about 600m of the 1g available memory was actually used. Although computer hardware prices fluctuate drastically, a comparably configured current computer could probably be built for under US$3000. That works out to 358,000 digits per dollar spent. Needless to say, that's far more "bang for the buck" than with a super computer.

The program was originally designed by Carey Bloodworth to compute just one million digits in one day on old, slow computers, such as 386 or 68000 systems. A very limited goal. The program evolved until the goal was to compute 32 million digits on a limited 486/66 system, and to do so faster than Dr. David Bailey did with a Cray-2 in 1986. In spite of the massive difference in computing power (well over 8 times, even allowing for the realities of programming), this goal was reached with a total run time of 36.5 hours, including 8.4 hours of disk I/O.

Minor modifications since then have increased the limit to 1 billion digits, as well as improving performance on Pentium class processors.

The program is designed to be extremely portable to a wide range of systems, processors, compilers and memory amounts. It is specifically designed for _desktop_ systems, with the speed, disk, and memory limitations they all share, rather than for large workstations, mainframes, or supercomputers.

The program also includes a large number of options and features for the hobbiest pi computers who are only wanting a few million digits, or for people with older, slower computers. It's designed to satisfy people who want to benchmark various pi programs at just 1 million digits, all the way to people who want to compute hundreds of millions of digits.

The program is written for a 32 bit processor, in standard ANSI/ISO portable C. It also includes hand written assembly modules for some key routines for those of us using GNU C and an Intel x86 processor.

It is distributed in full source code with extensive documentation and tutorials. This allows everybody to experiment with a state of the art single processor pi program, no matter who they are or what computer system they have.

The program implements two different Arithmetic Geometric Mean (AGM) formulas, along with the Borwein Quartic formula. The AGMs include 5 different implementations, to allow for varying disk capacities and user preference. The primary formula is a basic Salamin-Brent-Gauss AGM modified to remove the need for the two value multiplication, needing only a single squaring operation, a square root, and a handful of other simple operations. The AGM implementation also has rudimentary abilities to run self checks on the computation, detecting many forms of hardware and software errors.

All multiplications are done using 'half' sized numbers, meaning the 1 billion digit computation only multiplied 512 million digit numbers. This significantly reduces the memory and disk required.

The basic data format is a simple fixed point number format, using explicit disk based numbers, avoiding all use of virtual memory.

The square root and reciprocation are standard Newton routines that utilize Fast Fourier Transform (FFT) caching and the techniques that Alan Karp and Peter Markstein published in 1993. Under optimal memory conditions, this results in the square root and division taking about 3.6 and 3.9 times the cost of a squaring multiply, or 2.7 and 2.9 times the cost of a two value multiply.

The multiplication is done using Number Theoretic Transforms, rather than a floating point FFT. Simple modular arithmetic is used to spread 32 decimal digits into 8 'small' separate NTTs, and then later a large 32 byte (256 bit) Chinese Remainder Theorem is used to combine the 8 parts back into the final result. The reduced size of the NTT allows to do each NTT fully in memory, and then explicitely page the data to disk outside of the transform. Due to the inherently segmented nature of using 8 separate NTTs, and the low memory requirements of each, a regular disk based transform is not required or even included.

The program can also use the standard Karatsuba multiplication in case even the small size of the individual NTTs will not fit into memory. This is really a remnant of earlier programs, but it can still be useful in an 'emergency'. The 1 billion digit run with the Pentium specific version had to use one level, due to the reasons explained later. The other versions do not need it.

The structure of the NTT transform itself is a recursive / iterative combination. The recursive part breaks the transform into cache size chunks, which greatly improves the performance with a minimum of programming effort. The Pentium (586) version of the NTT is based upon some clever assembly code written by Jason Stratos Papadopoulos at the University of Maryland. He observed that the Pentium's FPU is significantly faster than the integer unit, so he wrote some FPU assembly to do 4 modular multiplications at a time. Since the Pentium's FPU can only easily deal with 32 bit signed data (31 bits of unsigned information), this does reduce the multiplication size to only 256 million digit numbers, rather than the 1 billion digit multiplication that the regular version can do.

Montgomery style modular multiplication is not used, due to the limited number of registers on Intel's x86 architecture. (Although this could have been added if there had been a need.) The regular modular multiplication is done as a normal multiply & divide operation. The Pentium version, of course, uses the FPU to multiply by the reciprocal of the prime.

A 512 million digit pi computation requires as little as 64 megabytes for a single NTT transform, although it would prefer 1g bytes for full in memory multiplication. Total disk space prefered is approximately 8*Digits bytes, although that can be reduced to as little as 4*Digits with reduced performance. Those amounts are well within what a typical home desktop computer can supply.

The program has a current limit of 1 billion digits. A few minor modifications might allow the non-586 version to reach 2 billion digits of pi, but that has not yet been attempted. It's doubtful that the increase in run time would be worth the effort to go beyond that. Few current desktop systems would have the disk space, the memory, or the speed to compute 4 billion digits, much less try to take the world record (64g digits) away from Professor Kanada (Tokyo, Japan). The 1 (or 2) billion digit limitation is expected to stand.

It's difficult to calculate the exact number of operations that the program requires. Very roughly, the basic number of operations (integer and floating point) is: O(N*Log2(N)^2). For the Pentium version, the constant is about 106. For the 486 version, the constant is about 55.5.

For the Pentium version, this implies that computing 1 million digits of pi takes 4.4e10 individual instructions, while the 1 billion digits of pi computation executed approximately 1.0e14 individual instructions. The check run, which used the 486 version, executed approximately 5.4e13 instructions.

However, that's *very* misleading. Due to the program being designed for the desktop, the amount of memory available and the disk access time is also important. The more digits you compute, the more the disk will be needed, and the additional I/O time raises the total growth above the actual computational growth given by the formula above.

The program actually has a number of pivot points where the amount of disk I/O increases, even though the number of operations changes only slightly. Depending upon the amount of memory available and the number of digits, the program may be able to do the full multiplication in memory, or it may have to start paging the 8 separate NTTs to disk. That extra disk I/O time will result in an increase in run time even though a system with more memory would require a comparable number of operations.

The total amount of time consumed by Disk I/O is typically 20%-30%. For the computation time, the eight NTT transforms take around 50-60% of the total computation. Other multiplication items (in order of consumption), such as the large CRT, releasing the carries, loading the NTTs, and the convolution, result in the full multiplication accounting for about 95% of the total computation. Although since the other items grow nearly linearly, their total importance will slowly descrease as the multiplication lengths increase, with the transform taking a greater precentage of computation.

The current version is v2.3 and is distributed with full C source under a 'copyrighted, but free for any use provided credit is given' policy. It can be freely used, but if you use it (or any ideas from it) in your own program, credit must be given.

It should be available at:

http://www.geocities.com/EnchantedForest/5815/program.html

-or-

http://www.isr.umd.edu/~jasonp/pipage.html

-or-

http://home.istar.ca/~lyster/pi.html

The archive contains a variety of multiplication modules for experimentation, and includes a pi computation tutorial and extensive documentation that describes much of the program's history. It developed from Carey's hobby rather than any explicit goal of computing 1 billion digits of pi.

Anybody can write a pi program when they have enough system resources. That's nearly trivial. The challenge is to do it on systems that shouldn't be capable of it! Sometimes it means being creative or even sacrificing performance (better to do it slowly than not at all), but it's the result that counts! It's FUN! Who cares what the 987,654,321st digit of pi is??! It's the challenge that's fun! This program may not be the fastest or most efficient or the most creative, or the best at anything, but it's the only desktop program we know of that's reached a billion digits!

June 21, 1999

Carey Bloodworth cbloodworth@juno.com
Dominique Delande delande@spectro.jussieu.fr ---- ___________________________________________________________________ Get the Internet just the way you want it. Free software, free e-mail, and free Internet access for a month! Try Juno Web: http://dl.www.juno.com/dynoget/tagj.