UMBC High Performance Computing Facility : Monitoring and Controlling Jobs on HPC
This page last changed on Jan 18, 2009 by straha1.
Now that you've learned to compile programs and submit jobs, you need to know how to monitor and delete them. (Make sure you read the QDel section of this page.) The PBS queuing system includes a number of programs for examining the PBS queue and monitoring or controlling your jobs. This page discusses the following topics:
For information on submitting jobs using QSub, see the second part of this tutorial or our page on QSub: Using QSub. All three of these commands have manual pages which can be accessed through the UNIX man program:
For detailed information on QSub, QStat and QDel, see Running Jobs on HPC.
Occasionally you might realize you messed up an input parameter, typed the wrong executable name or made some other mistake. Rather than letting your incorrectly-configured job run, you can cancel it using the qdel command:
where "3172.hpc.cl.rs.umbc.edu" is the job number returned from qsub. (If you forgot your job number, you can use qstat to determine what it is.) QDel can even cancel your job after it has started running. It may take a minute or two for your job to be deleted from the queue. You can use qstat to monitor the progress of the deletion.
Your job might be sitting in the queue for a while before it runs, depending on how many people are using the cluster. You can check the status of your job using qstat:
where 3172.hpc.cl.rs.umbc.edu should be replaced by whatever job number qsub returned. If your job is in the queue or running, that command should print out a message much like this:
where straha1 is replaced by your user name. The R indicates that your job is running. If you see a Q there, then your job is in the queue waiting to run. If qstat gives you this message:
then your job has either aborted, completed normally or been deleted. You can get much more detailed information about your job using the -f option to qsub:
which will print out extensive information, including the number of nodes used, the number of processors per node, which nodes were allocated, the queue, and much more.
You can see the list of all jobs in the queue by simply typing qstat (without any job number or options) which might produce something like this:
You can see details about other peoples' jobs using the same qstat -f command described in the previous section. If you notice that the cluster is especially busy right now, you may wish to wait before trying to debug a new MPI program, otherwise you might be waiting an hour or more every time you start the program.
|Document generated by Confluence on Mar 31, 2011 15:37|