Parallel Capabilities in Maple 11It's generally advisable to have a multiple-cpu machine (as otherwise parallelization makes processing slower).
The machine used for this demo was an Athlon X2 dual core.kernelopts(numcpus);
<Text-field style="Heading 1" layout="Heading 1">Capabilities/Limitations and the Package</Text-field>Current parallel capabilities are limited to thread-safe Maple (library) programs, and the kernel itself, which means:
External calls (calls to external compiled libraries) run sequentially, as there is no guarantee that ext. programs are re-entrant.
Maple library code will only work if there is no conflicting access for 'globals' (these can be read, and changed in an atomic way, but can be changed by any thread). The definition of 'globals' excludes only procedure locals, and environment variables. It includes globals, module locals, remember tables, etc. The exception is code that uses a Mutex when accessing globals.In summary, the current capabilities are limited to thread safe Maple code that you write yourself, or library routines that do not violate the limitations described above.
There are no direct parallel capabilities built into the kernel - the design is such that multiple operations can be performed, but the kernel itself does not launch multiple threads for large operations.Much of the Maple library code uses external libraries to perform certain operations (LinearAlgebra including Modular, numeric integration and ODE solution, string operations, etc.), so none of this code will run in parallel. In order to use these capabilities, you need to be running the parallel kernel, which can be enabled via the options dialog
(Tools LUklbXJvd0c2Iy9JK21vZHVsZW5hbWVHNiJJLFR5cGVzZXR0aW5nR0koX3N5c2xpYkdGJzYkLUkjbW9HRiQ2LVEoJnNyYXJyO0YnLyUsbWF0aHZhcmlhbnRHUSdub3JtYWxGJy8lJmZlbmNlR1EmZmFsc2VGJy8lKnNlcGFyYXRvckdGNC8lKXN0cmV0Y2h5R0Y0LyUqc3ltbWV0cmljR0Y0LyUobGFyZ2VvcEdGNC8lLm1vdmFibGVsaW1pdHNHRjQvJSdhY2NlbnRHRjQvJSdsc3BhY2VHUSYwLjBlbUYnLyUncnNwYWNlR0ZDRi8= Options LUklbXJvd0c2Iy9JK21vZHVsZW5hbWVHNiJJLFR5cGVzZXR0aW5nR0koX3N5c2xpYkdGJzYkLUkjbW9HRiQ2LVEoJnNyYXJyO0YnLyUsbWF0aHZhcmlhbnRHUSdub3JtYWxGJy8lJmZlbmNlR1EmZmFsc2VGJy8lKnNlcGFyYXRvckdGNC8lKXN0cmV0Y2h5R0Y0LyUqc3ltbWV0cmljR0Y0LyUobGFyZ2VvcEdGNC8lLm1vdmFibGVsaW1pdHNHRjQvJSdhY2NlbnRHRjQvJSdsc3BhY2VHUSYwLjBlbUYnLyUncnNwYWNlR0ZDRi8= Enable SMP Support, then restart), or by the -M option for TTY (there is no classic support at this time).The functionality is provided by the Threads package, which has 2 subpackages to deal with Mutex operations (mutual exclusion access to global data), and ConditionVariable (for thread synchronization).with(Threads);Note that three of the common sequence-type operations (Add, Mul, and Seq) have parallel implementations in this package.One final note: the threads implementation in the Windows API is missing significant functionality when considered with respect to the standard threading API. These operations are emulated on Windows, so performance gains will be much smaller there.
<Text-field style="Heading 1" layout="Heading 1">Simple Examples</Text-field>Let's see how much of a timing improvement we can get with a large 'add' operation.s1 := [seq(rand(),i=1..10000)]:
s1 := map(op,[s1$1000]):
nops(s1);tt := time():
add(i,i=s1);
t1 := time()-tt;tt := time():
Add(i,i=s1);
t2 := time()-tt;t1/t2;So fairly close to a factor of 2.
Timings are far less impressive for smaller problems:s1 := [seq(rand(),i=1..1000)]:
s1 := map(op,[s1$1000]):
nops(s1);tt := time():
add(i,i=s1);
t1 := time()-tt;tt := time():
Add(i,i=s1);
t2 := time()-tt;t1/t2;
<Text-field style="Heading 1" layout="Heading 1">Example of Thread Programming</Text-field>For this example, consider that we are attempting to factor an integer by the trial division method.
The plan is that we want split the task into multiple processes, sending each a batch of primes to use.
Note that this is a somewhat dumb implementation. A proper implementation would have facilities for launching a new batch when another is finished instead of having to wait for a whole batch. TrialDivisionFactor := module()
local ModuleApply,Trial, primes,batch,n,res;
# Primes up to 1000000
primes := [seq(ithprime(i),i=1..78498)]:
Trial := proc(low,thread) local i;
for i from low to min(nops(primes),low+batch-1) do
if irem(n,primes[i])=0 then
res[thread] := primes[i]; return;
end if;
end do;
res[thread] := 0;
end proc:
ModuleApply := proc(num,{numthreads::posint:=1},{batchsize::posint:=1000})
uses Threads;
local ids,i,j;
# Init
n := num; batch := batchsize;
res := Vector(numthreads);
for i from 1 by numthreads*batchsize to nops(primes) do
ids := seq(Create(Trial(i+(j-1)*batchsize,j)), j=1..numthreads);
Wait(ids);
if {seq(res[j],j=1..numthreads)}<>{0} then
return ({seq(res[j],j=1..numthreads)} minus {0})[1];
end if;
end do;
FAIL;
end proc:
end module:Check timing for a single thread - one controller and one batch of 100 at a time:tt := time():
TrialDivisionFactor(nextprime(1000000),numthreads=1);
time()-tt;tt := time():
TrialDivisionFactor(nextprime(1000000),numthreads=2);
time()-tt;tt := time():
TrialDivisionFactor(nextprime(1000000),numthreads=3);
time()-tt;tt := time():
TrialDivisionFactor(nextprime(1000000),numthreads=4);
time()-tt;
<Text-field style="Heading 1" layout="Heading 1">What is and isn't parallel?</Text-field>We will use a simple program to test 2 simultaneous threads in an attempt to determine if the basic underlying operations can be implemented in parallel. RunTest := proc(func::procedure, arg::list)
uses Threads;
local t1,t2:
lprint(ssystem("date"));
t1 := time( [func(op(arg)),func(op(arg))] ):
lprint(ssystem("date"));
t2 := time( Wait(Create(func(op(arg))),
Create(func(op(arg)))) ):
lprint(ssystem("date"));
printf("seq=%6.3f, para=%6.3f, ratio=%5.3f\134n",
t1,t2,t1/t2);
NULL;
end proc:Large integer multiplication (gmp):rr := rand(2^128):
nm := [seq(rr(),i=1..2000)]:testmul := proc(nm) local i,j;
for i in nm do for j in nm do i*j end do end do:
end proc:RunTest(testmul,[nm]);Similar using kernel addtestmul2 := proc() local i,j;
add(add(i*j,j=nm),i=nm);
end proc:RunTest(testmul2,[]);Polynomial multiplicationnm := [seq(randpoly(x,degree=9,dense),i=1..2000)]:RunTest(testmul,[nm]);modp1 polynomial multiplicationp := 23:
nm := [seq(modp1(ConvertIn(randpoly(x,degree=9,dense),x),p),i=1..2000)]:testmul3 := proc(nm,p) local i,j;
for i in nm do for j in nm do
modp1(Multiply(i,j),p);
end do end do:
end proc:RunTest(testmul3,[nm,p]);
<Text-field style="Heading 1" layout="Heading 1">Other parallelizm in Maple</Text-field>At this time, the ATLAS BLAS in use by Maple have multiprocessing for 2 processors built in.Consider:M := LinearAlgebra:-RandomMatrix(8000,outputoptions=[datatype=float[8]]):ssystem(date);
tt := time():
M.M:
time()-tt;
ssystem(date);This entire operation runs in parallel, using 100% of 2 CPU the whole time. Note that the running duration is actually 2:18 = 138 sec - 1/2 the CPU time.
The total run-time for the non-parallel version is 265.04 sec 2.6 sec shorter, but the duration is the same.