Contents¶
MDCS using a local MATLAB client¶
To submit jobs through a local MATLAB client in Apolo II or Cronos using SLURM follow next steps to got the integration:
Integration scripts¶
Add the MATLAB integration scripts to your MATLAB PATH by placing the integration scripts into
$HOME/Documents/matlab-integration
directory (matlab-apolo.zip
).Linux
mkdir $HOME/Documents/matlab-integration mv path-to-file/matlab-apolo.zip $HOME/matlab-integration/ cd $HOME/Documents/matlab-integration unzip matlab-apolo.zip rm matlab-apolo.zip
Windows
To-Do
Open your MATLAB client to configure it.
(If MATLAB client is installed in a system directory, we strongly suggest to open it with admin privileges, it is only necessary the first time to configure it).
Add the integrations scripts to the MATLAB PATH
Press the “Set Path” button
Press the “Add with Subfolders” button and choose the directories where you unzip the integrations scripts (Apolo II and Cronos) and finally press the “Save” button:
/home/$USER/matlab-integration/apolo
Configuring cluster profiles¶
Open again your MATLAB client (without admin privileges)
Configure MATLAB to run parallel jobs on your cluster by calling
configCluster
.>> configCluster Cluster FQDN (e.g. apolo.eafit.edu.co): cronos.eafit.edu.co Username on Apolo (e.g. mgomezz): mgomezzul >> % Must set TimeLimit before submitting jobs to Apolo II or >> % Cronos cluster >> % e.g. to set the TimeLimit and Partition >> c = parcluster('apolo remote R2018a'); >> c.AdditionalProperties.TimeLimit = '1:00:00'; >> c.AdditionalProperties.Partition = 'longjobs'; >> c.saveProfile >> % e.g. to set the NumGpus, TimeLimit and Partition >> c = parcluster('apolo remote R2018a'); >> c.AdditionalProperties.TimeLimit = '1:00:00'; >> c.AdditionalProperties.Partition = 'accel'; >> c.AdditionalProperties.NumGpus = 2; >> c.saveProfile
Custom options
TimeLimit
→ Set a limit on the total run time of the job allocation (more info).- e.g.
c.AdditionalProperties.TimeLimit = ‘3-10:00:00’;
- e.g.
AccountName
→ Change the default user account on Slurm.- e.g.
c.AdditionalProperties.AccountName = ‘apolo’;
- e.g.
ClusterHost
→ Another way to change the cluster hostname to sumbit jobs.- e.g.
c.AdditionalProperties.ClusterHost = ‘apolo.eafit.edu.co’;
- e.g.
EmailAddress
→ Get all job notifications by e-mail.- e.g.
c.AdditionalProperties.EmailAddress = ‘apolo@eafit.edu.co’;
- e.g.
EmailType
→ Get only the desired notifications based on sbatch options.- e.g.
c.AdditionalProperties.EmailType = ‘END,TIME_LIMIT_50’;
- e.g.
MemUsage
→ Total amount of memory per machine (more info).- e.g.
c.AdditionalProperties.MemUsage = ‘5G’;
- e.g.
NumGpus
→ Number of GPUs (double) to use in a job.- e.g.
c.AdditionalProperties.NumGpus = 2;
Note
The maximum value for
NumGpus
is two, also if you select this option you should use the ‘accel’ partition on Apolo II.- e.g.
Partition
→ Select the desire partition to submit jobs (by default longjobs partition will be used)- e.g.
c.AdditionalProperties.Partition = ‘bigmem’;
- e.g.
Reservation
→ Submit a job into a reservation (more info).- e.g.
c.AdditionalProperties.Reservation = ‘reservationName’;
- e.g.
AdditionalSubmitArgs
→ Any valid sbatch parameter (raw) (more info)- e.g.
c.AdditionalProperties.AdditionalSubmitArgs = ‘–no-requeue’;
- e.g.
Submitting jobs¶
General steps¶
Load ‘apolo remote R2018a’ cluster profile and load the desired properties to submit a job.
>> % Run cluster configuration >> configCluster Cluster FQDN (e.g. apolo.eafit.edu.co): cronos.eafit.edu.co Username on Apolo (e.g. mgomezz): mgomezzul >> c = parcluster('apolo remote R2018a'); >> c.AdditionalProperties.TimeLimit = '1:00:00'; >> c.AdditionalProperties.Partition = 'longjobs'; >> c.saveProfile
To see the values of the current configuration options, call the specific
AdditionalProperties
method.>> % To view current properties >> c.AdditionalProperties
To clear a value, assign the property an empty value (
''
,[]
, orfalse
).>> % Turn off email notifications >> c.AdditionalProperties.EmailAddress = '';
If you have to cancel a job (queued or running) type.
>> j.cancel
Delete a job after results are no longer needed.
>> j.delete
Serial jobs¶
Use the batch command to submit asynchronous jobs to the cluster. The batch command will return a job object which is used to access the output of the submitted job.
(See the MATLAB documentation for more help on batch.)
function t = serial_example(n) t0 = tic; A = 500; a = zeros(n); for i = 1:n a(i) = max(abs(eig(rand(A)))); end t = toc(t0); end
>> % Get a handle to the cluster >> c = parcluster('apolo remote R2018a'); >> % Submit job to query where MATLAB is running on the cluster (script) >> j = c.batch(@serial_example, 1, {1000}); >> % Query job for state >> j.State >> % Load results >> j.fetchOutputs{:} >> % Delete the job after results are no longer needed >> j.delete
To retrieve a list of currently running or completed jobs, call
parcluster
to retrieve the cluster object. The cluster object stores an array of jobs that were run, are running, or are queued to run. This allows us to fetch the results of completed jobs. Retrieve and view the list of jobs as shown below.>> c = parcluster('apolo remote R2018a'); >> jobs = c.Jobs
Once we have identified the job we want, we can retrieve the results as we have done previously.
fetchOutputs
is used to retrieve function output arguments; if using batch with a script, useload
instead.Data that has been written to files on the cluster needs be retrieved directly from the file system. To view results of a previously completed job:
>> % Get a handle on job with ID 2 >> j2 = c.Jobs(2); >> j2.fetchOutputs{:}
Note
You can view a list of your jobs, as well as their IDs, using the above
c.Jobs
command.Another example using a MATLAB script.
t0 = tic; A = 500; a = zeros(100); fileID = fopen('/home/mgomezzul/time.txt','wt'); for i = 1:100 a(i) = max(abs(eig(rand(A)))); end t = toc(t0); fprintf(fileID, '%6.4f\n', t); fclose(fileID);
- Job submission
>> % Get a handle to the cluster >> c = parcluster('apolo remote R2018a'); >> % Submit job to query where MATLAB is running on the cluster (script) >> j = c.batch('serial_example_script'); >> % Query job for state >> j.State >> %Load results into the client workspace >> j.load >> % Delete the job after results are no longer needed >> j.delete
Another example using a MATLAB script that supports GPU.
maxIterations = 500; gridSize = 1000; xlim = [-0.748766713922161, -0.748766707771757]; ylim = [ 0.123640844894862, 0.123640851045266]; % Setup t = tic(); x = gpuArray.linspace( xlim(1), xlim(2), gridSize ); y = gpuArray.linspace( ylim(1), ylim(2), gridSize ); [xGrid,yGrid] = meshgrid( x, y ); z0 = complex( xGrid, yGrid ); count = ones( size(z0), 'gpuArray' ); % Calculate z = z0; for n = 0:maxIterations z = z.*z + z0; inside = abs( z )<=2; count = count + inside; end count = log( count ); % Show count = gather( count ); % Fetch the data back from the GPU naiveGPUTime = toc( t ); fig = gcf; fig = figure('visible', 'off'); fig.Position = [200 200 600 600]; imagesc( x, y, count ); colormap( [jet();flipud( jet() );0 0 0] ); axis off; title( sprintf( '%1.2fsecs (GPU)', naiveGPUTime ) ); saveas(gcf,'/home/mgomezzul/GPU.png');
- Job submission
>> % Get a handle to the cluster >> c = parcluster('apolo remote R2018a'); >> % Submit job to query where MATLAB is running on the cluster (script) >> j = c.batch('gpu_script'); >> % Query job for state >> j.State
Another example using Simulink via MATLAB.
% Example running a Simulink model. % The Simulink model is called |parsim_test.slx| and it *must be* in % the cluster. % Number of simulations numSims = 10; W = zeros(1,numSims); % Changing to the |parsim_test.slx| path cd /home/mgomezzul/tests/matlab/slurm % Create an array of |SimulationInput| objects and specify the argument value % for each simulation. The variable |x| is the input variable in the Simulink % model. for x = 1:numSims simIn(x) = Simulink.SimulationInput('parsim_test'); simIn(x) = setBlockParameter(simIn(x), 'parsim_test/Transfer Fcn', 'Denominator', num2str(x)); end % Running the simulations. simOut = parsim(simIn); % The variable |y| is the output variable in the Simulink model. for x = 1:numSims W(1,x) = max(simOut(x).get('y')); end save('/home/mgomezzul/output_file.mat','W');
parsim_test.slx (Simulink model)
- Job submission
>> % Get a handle to the cluster >> c = parcluster('apolo remote R2018a'); >> % Submit job to query where MATLAB is running on the cluster (script) >> j = c.batch('parsim_test_script'); >> % Query job for state >> j.State >> % Load data to client workspace >> j.load
Parallel or distributed jobs¶
Users can also submit parallel or distributed workflows with batch command. Let’s use the following example for a parallel job.
function t = parallel_example(n)
t0 = tic;
A = 500;
a = zeros(n);
parfor i = 1:n
a(i) = max(abs(eig(rand(A))));
end
t = toc(t0);
end
We will use the batch command again, but since we are running a parallel job, we will also specify a MATLAB pool.
>> % Get a handle to the cluster >> c = parcluster('apolo remote R2018a'); >> % Submit a batch pool job using 4 workers >> j = c.batch(@parallel_example, 1, {1000}, 'Pool', 4); >> % View current job status >> j.State >> % Fetch the results after a finished state is retrieved >> j.fetchOutputs{:} ans = 41.7692
- The job ran in 41.7692 seconds using 4 workers.
Note
Note that these jobs will always request N+1 CPU cores, since one worker is required to manage the batch job and pool of workers. For example, a job that needs eight workers will consume nine CPU cores.
Note
For some applications, there will be a diminishing return when allocating too many workers, as the overhead may exceed computation time (communication).
We will run the same simulation, but increase the pool size. This time, to retrieve the results later, we will keep track of the job ID.
>> % Get a handle to the cluster >> c = parcluster('apolo remote R2018a'); >> % Submit a batch pool job using 8 workers >> j = c.batch(@parallel_example, 1, {1000}, ‘Pool’, 8); >> % Get the job ID >> id = j.ID Id = 4 >> % Clear workspace, as though we quit MATLAB >> clear
Once we have a handle to the cluster, we will call the
findJob
method to search for the job with the specified job ID.>> % Get a handle to the cluster >> c = parcluster('apolo remote R2018a'); >> % Find the old job >> j = c.findJob(‘ID’, 4); >> % Retrieve the state of the job >> j.State ans finished >> % Fetch the results >> j.fetchOutputs{:} ans = 22.2082
The job now runs 22.2082 seconds using 8 workers.
Run code with different number of workers to determine the ideal number to use.
Another example using a parallel script.
n = 1000; t0 = tic; A = 500; a = zeros(n); fileID = fopen('/home/mgomezzul/time.txt','wt'); parfor i = 1:n a(i) = max(abs(eig(rand(A)))); end t = toc(t0); fprintf(fileID, '%6.4f\n', t); fclose(fileID);
>> % Get a handle to the cluster >> c = parcluster('apolo remote R2018a'); >> % Submit job to query where MATLAB is running on the cluster (script) >> j = c.batch('parallel_example_script', 'Pool', 8); >> % Query job for state >> j.State >> %Load results >> j.load >> % Delete the job after results are no longer needed >> j.delete
Debugging¶
If a serial job produces an error, we can call the
getDebugLog
method to view the error log file using j.Tasks(1). Additionally when submitting independent jobs, with multiple tasks, you will have to specify the task number.>> % If necessary, retrieve output/error log file >> j.Parent.getDebugLog(j.Tasks(1))
For pool jobs, do not diference into the job object.
>> % If necessary, retrieve output/error log file >> j.Parent.getDebugLog(j) >> % or >> c.getDebugLog(j)
To get information about the job in SLURM, we can consult the scheduler ID by calling
schedID
.>> schedID(j) ans = 25539
MDCS using cluster’s MATLAB client¶
Submitting jobs from within MATLAB client on the cluster¶
General steps¶
Connect to Apolo II or Cronos via SSH.
# Without graphical user interface ssh username@cronos/apolo.eafit.edu.co # or with graphical user interface ssh -X username@cronos/apolo.eafit.edu.co
Load MATLAB modufile.
module load matlab/r2018a
Run MATLAB client
matlab
First time, you have to define the cluster profile running the following command.
configCluster
Load ‘apolo’ or ‘cronos’ cluster profile and load the desired properties to submit a job (MATLAB GUI or command line).
>> % Must set TimeLimit before submitting jobs to Apolo II or >> % Cronos cluster >> % e.g. to set the TimeLimit and Partition >> c = parcluster('apolo/cronos'); >> c.AdditionalProperties.TimeLimit = '1:00:00'; >> c.AdditionalProperties.Partition = 'longjobs'; >> c.saveProfile >> % or >> % e.g. to set the NumGpus, TimeLimit and Partition >> c = parcluster('apolo'); >> c.AdditionalProperties.TimeLimit = '1:00:00'; >> c.AdditionalProperties.Partition = 'accel'; >> c.AdditionalProperties.NumGpus = '2'; >> c.saveProfile
To see the values of the current configuration options, call the specific
AdditionalProperties
method.>> % To view current properties >> c.AdditionalProperties
To clear a value, assign the property an empty value (
''
,[]
, orfalse
).>> % Turn off email notifications >> c.AdditionalProperties.EmailAddress = '';
Submitting jobs¶
Note
Users can submit serial, parallel or distributed jobs with batch command as the previous examples.
Submitting jobs directly through SLURM¶
MDCS jobs could be submitted directly from the Unix command line through SLURM.
For this, in addition to the MATLAB source, one needs to prepare a MATLAB submission script with the job specifications.
An example is shown below:
%========================================================== % MATLAB job submission script: matlab_batch.m %========================================================== workers = str2num(getenv('SLURM_NTASKS')); c = parcluster('apolo'); c.AdditionalProperties.TimeLimit = '1:00:00'; c.AdditionalProperties.Partition = 'longjobs'; j = c.batch(@parallel_example_slurm, 1, {1000}, 'pool', workers); exit;
function t = parallel_example_slurm(n) t0 = tic; A = 500; a = zeros(n); parfor i = 1:n a(i) = max(abs(eig(rand(A)))); end t = toc(t0); save prueba.txt t -ascii end
It is submitted to the queue with help of the following SLURM batch-job submission script:
#!/bin/bash #SBATCH -J test_matlab #SBATCH -o test_matlab-%j.out #SBATCH -e test_matlab-%j.err #SBATCH -p longjobs #SBATCH -n 8 #SBATCH -t 20:00 module load matlab/r2018a matlab -nosplash -nodesktop -r "matlab_batch"
Job is submitted as usual with:
sbatch matlab.slurm
Note
This scheme dispatches 2 jobs - one serial that spawns the actual MDCS parallel jobs, and another, the actual parallel job.
Once submitted, the job can be monitored and managed directly through SLURM
squeue
command output
After the job completes, one can fetch results and delete job object from within MATLAB client on the cluster. If program writes directly to disk fetching is not necessary.
>> c = parcluster('apolo'); >> jobs = c.Jobs >> j = c.Jobs(7); >> j.fetchOutputs{:}; >> j.delete;
MATLAB directly on the cluster¶
Next steps describes how to use MATLAB and its toolboxes without MDCS (workers) toolbox, but this way has next pros and cons.
- Pros
- No workers limitations
- Cons
- No distributed jobs (Only parallel or serial jobs)
Unattended job¶
To run unattended jobs on the cluster follow next steps:
Connect to Apolo II or Cronos via SSH.
ssh username@cronos.eafit.edu.co
Enter to the matlab directory project.
cd ~/test/matlab/slurm
Create a SLURM batch-job submission script
#!/bin/bash #SBATCH -J test_matlab #SBATCH -o test_matlab-%j.out #SBATCH -e test_matlab-%j.err #SBATCH -p bigmem #SBATCH -n 8 #SBATCH -t 20:00 module load matlab/r2018a matlab -nosplash -nodesktop < parallel_example_unattended.m
p = parpool(str2num(getenv('SLURM_NTASKS'))); t0 = tic; A = 500; a = zeros(1000); parfor i = 1:1000 a(i) = max(abs(eig(rand(A)))); end t = toc(t0) exit
Submit the job.
sbatch slurm.sh
Check the
stdout
file (test_matlab_xxxx.out
).MATLAB is selecting SOFTWARE OPENGL rendering. < M A T L A B (R) > Copyright 1984-2018 The MathWorks, Inc. R2018a (9.4.0.813654) 64-bit (glnxa64) February 23, 2018 To get started, type one of these: helpwin, helpdesk, or demo. For product information, visit www.mathworks.com. >> Starting parallel pool (parpool) using the 'local' profile ... connected to 8 workers. >> >> >> >> >> >> t = 22.5327
Interactive job (No GUI)¶
If it is necessary the user can run interactive jobs following next steps:
Connect to Apolo II or Cronos via SSH.
ssh username@apolo.eafit.edu.co
Submit a interactive request to the resource manager
srun -N 1 --ntasks-per-node=2 -t 20:00 -p debug --pty bash # If resources are available you get inmediatily a shell in a slave node # e.g. compute-0-6 module load matlab/r2018a matlab
MATLAB is selecting SOFTWARE OPENGL rendering. < M A T L A B (R) > Copyright 1984-2018 The MathWorks, Inc. R2018a (9.4.0.813654) 64-bit (glnxa64) February 23, 2018 To get started, type one of these: helpwin, helpdesk, or demo. For product information, visit www.mathworks.com. >> p = parpool(str2num(getenv('SLURM_NTASKS'))); Starting parallel pool (parpool) using the 'local' profile ... >> p.NumWorkers ans = 2
Note
At this point you have an interactive MATLAB session through the resource manager (SLURM), giving you the possibility to test and check different MATLAB features.
To finish this job, you have to close the MATLAB session and then the bash session granted in the slave node.
References¶
- Parallel Computing Toolbox
- MATLAB Distributed Computing Server
- “Portions of our documentation contain content originally created by Harvard FAS Research Computing and adapted by us under the Creative Commons Attribution-NonCommercial 4.0 International License. More information: https://rc.fas.harvard.edu/about/attribution/”